Computing Nearest Covariance And Correlation Matrices .

3y ago
38 Views
2 Downloads
482.77 KB
69 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Jayda Dunning
Transcription

Computing Nearest Covariance and CorrelationMatricesLucas, Craig2001MIMS EPrint: 2015.40Manchester Institute for Mathematical SciencesSchool of MathematicsThe University of ManchesterReports available from:And by he MIMS SecretarySchool of MathematicsThe University of ManchesterManchester, M13 9PL, UKISSN 1749-9097

Computing Nearest Covarian e andCorrelation Matri esA thesis submitted to the University of Man hester for the degree of Masterof S ien e in the Fa ulty of S ien e and Engineering.O tober 2001Craig Lu asDepartment of Mathemati s1

ContentsAbstra t4De laration5Copyright and Intelle tual Property Rights6A knowledgements71 Introdu tion1.1 Covarian e and Correlation Matri es1.2 Appli ation . . . . . . . . . . . . . .1.3 Properties . . . . . . . . . . . . . . .1.4 Eigenvalues of a Correlation Matrix .8899102 Cal ulation of Covarian e and CorrelationMatri es112.1 Exa t Sample Covarian e and Correlation Matri es . . . . . . . 112.2 Approximate Sample Covarian e and CorrelationMatri es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Testing163.1 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Test Ma hine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 The Nearest Correlation Matrix Problem4.1 The Problem . . . . . . . . . . . . . . . .4.2 Alternating Proje tions . . . . . . . . . . .4.3 Experiments . . . . . . . . . . . . . . . . .4.4 Sequential Quadrati Programming . . . .4.5 Experiments . . . . . . . . . . . . . . . . .2.181819252929

4.6 Matrix Completions . . . . . . . . . . . . . . . . . . . . . . . . . 314.7 Con lusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 The Nearest Covarian e Matrix Problem5.1 The Problem . . . . . . . . . . . . . . . .5.2 Multi-Dire tional Sear h Optimization . .5.3 Experiments . . . . . . . . . . . . . . . . .5.4 Con lusions . . . . . . . . . . . . . . . . .6 EÆ6.16.26.3.3434363740ient Implementation41MEX les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Some Timings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Con luding Remarks43Appendi es44A MATLAB M-FilesA.1 Computation of S and R M- les .A.2 Alternating Proje tion M-Files . .A.3 M-Files for fmin on . . . . . . .A.4 M-File Fun tion for mdsmax . . .4444485761.B MEX le for Partial Eigende omposition3.63

Abstra tWe look at two matrix nearness problems posed by a nan e ompany, wherenearness is measured in the Frobenius norm. Correlation and ovarian e matries are omputed from sampled sto k data with missing entries by a te hniquethat produ es matri es that are not positive semide nite. In the rst problemwe nd the nearest orrelation matrix that is positive semide nite and preservesany orrelations known to be exa t. In the se ond problem we investigate howthe missing elements in the data should be hosen in order to generate thenearest ovarian e matrix to the inde nite matrix from the ompleted set ofdata. We show how the former problem an be solved using an alternatingproje tions algorithm and how the latter problem an be investigated using amulti-dire tional sear h optimization method.4

De larationNo portion of the work referred to in this thesis has been submitted in supportof an appli ation for another degree or quali ation of this university or otherinstitute of learning.5

Copyright and Intelle tual Property RightsCopyright in text of the this thesis rests with the Author. Copies (by anypro ess) either in full, or of extra ts, may be made only in a ordan e withinstru tions given by the Author and lodged in the John Rylands UniversityLibrary of Man hester. Details may be obtained from the Librarian. This pagemust form part of any su h opies made. Further opies (by any pro ess) ofopies made in a ordan e with su h instru tions may not be made withoutthe permission (in writing) of the Author.The ownership of any intelle tual property rights whi h may be des ribed in thisthesis is vested in the University of Man hester, subje t to any prior agreementto the ontrary, and may not be made available for use by third parties withoutthe written permission of the University, whi h will preserve the terms andonditions of any su h agreement.Further information on the onditions under whi h dis losures and exploitationmay take pla e is available from the Head of the Department of Mathemati s.6

A knowledgementsI would like to thank my supervisor, Ni k Higham, for his guidan e and numerous helpful suggestions throughout the preparation of this thesis.Thanks also go to Glenn for his ompanionship and reminding me there is moreto my life than mathemati s.7

1 Introdu tion1.1 Covarian e and Correlation Matri esIn statisti s, a random variable, say Y , is a fun tion de ned over a samplespa e, , that has asso iated a real number, Y (e) y , for ea h out ome, e,in . We all y an observation. The sample spa e is the set of all possibleout omes of an experiment, the pro ess of obtaining this out ome from somephenomenon. For example, a random variable ould be The number of threesin an integer' if the sample spa e onsists of all integers, and for the out ome3003 we have Y (3003) 2.A sample is a subset of the sample spa e, and we say a random variable issampled if we measure the values y from this subset only.We an measure the spread or dispersion of the sampled random variable,and all this quantity the sample varian e. Similarly we an measure how twosampled random variables vary relative to ea h other and all this the sampleovarian e of the two random variables. Another measure between two randomvariables is their orrelation oeÆ ient. The orrelation oeÆ ient is a measureof the linear relation between the two variables. It takes values between 1and 1, where 1 indi ates perfe t positive orrelation and 1 perfe t negativeorrelation. Perfe t orrelation arises if x and y are ve tors of observations fortwo sampled random variables and x ky; k 2 R . We give formal de nitionsof the sample ovarian e and sample orrelation oeÆ ient later.If we have a olle tion of sampled random variables we an onstru t sampleovarian e matri es and sample orrelation matri es. The (i; j ) element of a8

sample ovarian e matrix is the ovarian e of the ith and j th random variables.Similarly the (i; j ) element of a sample orrelation matrix is the orrelationoeÆ ient of the ith and j th random variables.1.2 Appli ationCorrelation matri es get their name from the fa t they ontain orrelation oeÆ ients, but they also arise in non-statisti al appli ations in numeri al linearalgebra. They are used in error analysis of Cholesky fa torisation, in a preonditioner for iteration methods for solving symmetri positive de nite linearsystems and in error analysis of Ja obi methods for nding the eigensystem ofa symmetri matrix. See [3 for more details.Our appli ation, however, is a statisti al one, where sample ovarian e andorrelation matri es are generated from sto k data. We look at two problemsposed by a nan e ompany who use the matri es for analysis and predi tionpurposes.1.3 PropertiesCovarian e and orrelation matri es are symmetri and positive semide nite(see se tion 2.1.) Correlation matri es have a unit diagonal sin e a variable islearly perfe tly orrelated with itself.It is well known that for a positive semide nite matrix Aja j paiiija :jjThus for orrelation matri es we haveja j 1ijand the inequality is stri t if variables i and j are not perfe tly orrelated.9

1.4 Eigenvalues of a Correlation MatrixGershgorin's theorem states that the eigenvalues of A 2 C lie in the unionof n disks in the omplex plane. The ith disk is given byn D z 2 C : jzaiiij nXj 16 j ja jij;ni 1: n:iSin e a orrelation matrix, A, is symmetri all its eigenvalues are real so wehave for an eigenvalue ,j 1j n1:But also A is positive semide nite, so its eigenvalues are nonnegative and wehave nally0 n;iFurthermore, sin e tra e(A) Pn ;iiX n:ii10i : n:

2 Cal ulation of Covarian e and CorrelationMatri es2.1 Exa t Sample Covarian e and Correlation Matri esThere are several ways we an onstru t ovarian e and orrelation matri es.Consider a matrix P 2 R where ea h olumn represents m observationsof a random variable and ea h row observations at a parti ular time. That is,p is the ith observation of the j th random variable. Let S represent the sampleovarian e matrix, and R the sample orrelation matrix. Sample ovarian e,for the ith and j th random variable, is de ned asmnijs ijm11(pp ) (pp );Tiij(2.1)jwhere the oeÆ ient (m 1) 1 is alled the normalisation.Here, p and p represent the ith and j th olumns of P , and p sample mean of random variable p ,ijk2Rthekp k1mXm 1p :ikiThe sample orrelation oeÆ ient is de ned asr ij(pkpiip ) (pp k2 kpiiTjjp ):p k2(2.2)jjFrom (2.1) we an writes ijm11[(p1ip )(p1ijp ) (p2jip )(p2ijp ) j (pmi11p )(pimjp ) ;j

and therefore2S 66666642 6666664p11p 1p12p 2.p1p p21p 1p22p 2n. 6666664mmpp 2 ; : : : ; p1p 1 ; p22p 2 ; : : : ; p27777 [p21775np nnmnp 1p 2p 37777 [pm1775p 1 ; p 2p 2 ; : : : ; pmmnp :nnIf we de ne the ith observation as q [p 1 ; p 2 ; : : : ; p [ p1 ; p 2 ; : : : ; p 2 R as the ve tor of sample means we haveinp nnp 1p 2.p 1 ; p123p p2 27777 [p11775nn 3iiinT2Rnand p nS m1mX1 1(qp )(qip )iT2Rn :n(2.3)iWe an write (2.2) asr ij(m 1)sp;(m 1)s (m 1)sijpjjiiand de ningD1 2 diag (s111 2 ; s221 2 ; : : : ; s 1 2 ) nnSwe have that the sample orrelation matrix isR D1 2 SD1 2 2 R : S S12nn(2.4)

We an write (2.3) asS m m1111(PTp e )(P(PTp e )(PTp e )TTTep );TTwhere e [1; 1; : : : ; 1 2 R . Now, we an writemp m 1 P e;TsoS mm11(P1Tm 1 P ee )(Pm 1 ee P )m 1 ee )(Im 1 ee )P;TP (IT1TTTmTmwhere I is the m m identity matrix. Now, Im1 ee is idempotent soTmS m11P (ITmmm 1 ee )P:TNow m 1 ee is rank 1 with nonzero eigenvalue 1, so Im 1 ee has onezero eigenvalue, and the remainder are 1. Hen e S is positive semide nite withrank at most the rank of Im 1 ee whi h is m 1 (and ertainly n, asS 2 R ). For S to be positive de nite we learly need m n, that is, moreobservations than variables.It is worth noting the rank of S and R will be redu ed if there is anylinear dependen e, either by two random variables being perfe tly orrelatedor more generally if a olumn of P an be written as a linear ombination ofother olumns. Also if one variable is a tually a onstant then it will have zerovarian e and all the ovarian es involving it will also be zero.We de ne COV(P ) and COR(P ) to be the sample ovarian e and orrelation matri es respe tively, omputed from the sample data matrix P , and referto these as exa t. (See Appendix A.1 for gen ov.m whi h omputes COV(P )and gen or.m whi h omputes COR(P ).)TmmnTn13T

2.2 Approximate Sample Covarian e and CorrelationMatri esIn the nan e appli ation not all elements of the sample data matrix P areknown. That is, at a given moment in time it is not possible to re ord thevalue of all the sto ks. Thus we need a method to ompute an approximateovarian e and orrelation matrix. One su h method is a pairwise deletionmethod.We represent the missing data matrix elements by NaNs. We use (2.1) toompute ea h element of the ovarian e matrix, but we use only the data thatis available at ommon times for both variables. For example if we have2p ip16666666664iNaNp3ip4ip532777777;77756666666664p jp1jp2jp3jNaNp53777777;7775jithen in the omputation of s we use only those omponents for whi h data isavailable in both ve tors. Thusij13p [p 1 p 3 p 5 ;iii13p [p 1 p 3 p 5 ;ijjjjand the normalisation of m 1 is repla ed with the e e tive sample size minusone, giving21hs 2 p1ijip p 3iip p 5iip ii6664p1jp p3jp p5p jjj3777:5jIt is obvious that nothing in this method will for e S to be positive semidefinite. We all this S an approximate ovarian e matrix.14

The approximate orrelation matrix R is al ulated from (2.4). Note thatal ulating an approximate R from (2.2) in an analogous way to an approximateS above is not equivalent.We de ne COV(P ) and COR(P ) to be the approximate sample ovarian eand orrelation matri es respe tively, omputed from the data matrix P withmissing elements. (See Appendix A.1 for ov bar.m whi h omputes COV(P )and or bar.m whi h omputes COR(P ).)Inde nite ovarian e and orrelation matri es are a ommon problem. Ithas been reported on the MathWorks web site [16 that inde nite ovarian ematri es an be generated due to round-o error when a small number ofobservations are supplied to fun tions ov and ewstats in MATLAB.Users of S ienti Software International's statisti al software LISTRELhave also found the same problems due to, among other reasons, pairwisedeletion methods for dealing with in omplete data sets [13 .Finally, Finan ial Engineering Asso iates In .'s MakeVC [6 software laimsto re ognise the problem and make ovarian e matri es positive de nite if theyare not already.15

3 TestingIn this hapter we des ribe the test data we will use for our experiments.3.1 Test DataBiogenPACCAREle troni Arts In .Amgen In .eBay In .Mi rosoftQual omm In .NVIDIA Corp.We use data that is the sale pri es of the top 8 ompanies from the NASDAQ100 on August 10th, 2001. The pri es are for Aug 1st, 2001 and those at therst trading day of the previous nine months. See Table 3.1.1 Nov 1 002 Jan 1 Feb 1 Mar 2 Apr 1 May 1 Jun 2 Jul 1 Aug Table 3.1: Sale Pri es for 8 NASDAQ Companies16

Some values are removed to simulate an in omplete data set, giving the following matrix, with a NaN representing the missing data:2666666666666P 6:47065:37085:840377777777777777 :777777777775(3.1)3.2 Test Ma hineAll omputation was undertaken using MATLAB 6 on a 350 MHz Pentium IIrunning Linux.17

4 The Nearest Correlation Matrix Problem4.1 The ProblemWe look at the problem of ndingmin kAX k : X is a orrelation matrix with ertain elements xedF(4.1)where A A 2 R . Our interest is in the ase when A COR(P ) is anPapproximate orrelation matrix and has some exa t entries. kAk2 a2is the Frobenius norm.We observe that all orrelations involving a variable with missing entrieswill be approximate. From the omputation of our approximate orrelationmatrix we an see that a missing element in P will a e t a whole row andolumn of A. That is, a missing element for the ith random variable will ausethe ith row and the ith olumn to be approximate in the omputed orrelationmatrix.Sin e the order of variables in the orrelation matrix is arbitrary we anpermute any two rows and orresponding olumns. So we an arrange our approximate orrelation matrix, A, for the data matrix, P , ontaining k olumnsof data with no missing entries asTnnF24EBBTCi;jij35;where E E 2 R is the prin ipal submatrix ontaining the exa t orrelations between the sto ks 1: k, B 2 R ( ) is approximate as it holds theTkkk18nk

orrelations between sto ks that have missing data and those whi h do not,and C C 2 R ( ) ( ) ontains the approximate orrelations between then k sto ks that have data missing. Note that E will be positive semide niteas it is an exa t orrelation matrix.Thus we seek a nearest orrelation matrix X to A su h thatTnknk1 i; j k:x e ;ijij.In [10 a solution is found to the problemmin kW 1 2 (A X )W 1 2 k : X is a orrelation matrix Fby an alternating proje tions algorithm, where W is a symmetri positive definite matrix of weights. We follow the same approa h and ompare how usingthe weighted method to try and preserve exa t orrelations ompares with ourdire t solution of the problem.Also we onsider an approa h of applying sequential quadrati programmingand ask whether a matrix ompletion method is suitable.4.2 Alternating Proje tionsWe de nehA; B i tra e(ATB );whi h is an inner produ t on R that indu es the Frobenius norm.We de ne also the setsn E nY Y 2R :YTn2 Y Y 4TnEB 0BTC ;352Rn ; B 2 R (nkC 2 R(n19kn) (nk);k); yii 1 ;

where Y 0 means that Y is positive semide nite.We seek the matrix in the interse tion of and whi h is losest to A inthe unweighted Frobenius norm, where E is the exa t part of A as des ribedabove.Sin e our sets are both losed and onvex, so is there interse tion, sofrom [14, p. 69 , for example, it follows that the minimum in (4.1) is a hievedand it is a hieved at a unique matrix X .EChara terisationThe solution, X in (4.1), is hara terised by the ondition [14, p. 69 hZX i 0;X; AHere, Z is of the form2Z 4for all Z 2 \ :(4.2)EEZ13(4.3)5:Z1 Z2The normal one of a onvex set K R at B 2 K isTn K (B ) Y Y 2 R : hZT nnB; Y i 0 for all Zn 2K Y Y 2 R : hY; B i sup hY; Z i :(4.4)2The ondition (4.2) an be rewritten as A X 2 ( \ )(X ), the normalTnnZKEone to \ at X .For two onvex sets K1 and K2 ; (K1 \ K2 )(B ) K1 (B ) K2 (B ) if therelative interiors of the two set have a point in ommon [17, Cor. 23.8.1 . Thuswe haveEAsin e any matrix of the formX2 (X ) 24EBBT20I35E(X )(4.5)

whi h is positive de nite (where I is the (n k) (n k) identity matrix) isin the relative interiors of and .So we assume that E is positive de nite, whi h implies that we must havemore observations than sto ks with omplete data sets, sin e, as we saw inse tion 2.1, the rank of E is at most min(m 1; k). Thus we only onsider thease that m k 1 and E is positive de nite.From [10 with W I we haveE (A) Y V DV ; where VT2Rn has orthonormal olumnspspanning null (A) and D diag(d ) 0 : (4.6)iLemma 4.1 For A 2 (A);E2 (A) Y Y 4TE3F 050 H:F2R kkarbitrary; H diag(h ) arbitrary :iiProof. Any Z 2 (A) is of the form (4.3) with diag(Z2) I . LetE2Y 4FGGTH352 E(A):If G 6 0 or H 6 diag(h ) we an hoose (Z1 ) or (Z2 ) in (4.3) arbitrarilylarge and the same sign as G and H 6 0, respe tively, and violate the supondition (4.4). Therefore G 0 and H diag(h ) and any su h Y satis esthe sup ondition. Writeiiijijijijii2V DV 4T(V DV )11 (V DV )12(V DV )21 (V DV )22TTTTwhere (V DV )11 2 R .Tkk2135;

Theorem 4.1 The orrelation matrix X solves (4.1) if and only if2X A V DV 4Twhere V2RnF 00 H35 has orthonormal olumns spanning null(X ), D diag(d ) 0piand F (V DV )11 and H diag(h ) is arbitrary.TiiProof The result follows from ondition (4.5) on applying (4.6) andLemma 4.1 and noting that F is ompletely determined by the need to preserveE. Now, if a 1, whi h is true in the nan e appli ation, we also have thefollowing theorem whi h generalises [10, Thm. 2.5 iiTheorem 4.2 If A has diagonal elements aii 1 and t nonpositive eigenvaluesthen the nearest orrelation matrix that preserves the exa t part, E , has at leastt zero eigenvalues.Proof. From Theorem 4.1 we have2X A V DV 4TF 00 H35;where V

Matrices 2.1 Exact Sample ariance v Co and Correlation Matrices There are eral sev ys a w e w can construct ariance v co and correlation matrices. Consider a matrix P 2 R m n where h eac column ts represen m ations observ of a random ariable v and h eac w ro ations observ at particular time. That is, p ij is the i th ation observ of j random .

Related Documents:

nearest hundred thousand. tf Round 93,206 to the nearest ten. Round 289,996 to the nearest ten thousand. 6 Round 279,996 to the nearest hundred. 7 Round 999,999 to the nearest ten. 3 Round 269,996 to the nearest thousand. 9 Round 219,996 to the nearest ten. 10 Round 10,876 to the nearest hundred. 1 1 R

probability theory and statistics , a covariance matrix (also known as auto-covariance matrix , dispersion matrix , variance matrix , or variance-covariance matrix ) is a square matrix givin g th e covariance betw een each pair of elements of a . For complex random vectors, anothe r kind of second central moment, the pseudo-covariance .

2. Round the number 896 to the nearest hundred and the nearest ten. Does it round to the same number both times? 3. Round the number 836 to the nearest hundred and the nearest ten. Does it round to the same number both times? 11. Short Answer Write your answer in the box. 4. Round 234 to the nearest hundred. Answer: 5. Round 68 to the nearest .

costs 200. You have rounded the price to the nearest 100. To round to the nearest 10, look at the units digit to decide which ten is nearest. To round to the nearest 100, look at the tens digit to decide which hundred is nearest. 274 rounded to the nearest 10 is 270. 270 271 27

What is 459 rounded to the nearest ten? What is 459 rounded to the nearest hundred? 3. 392 Expanded form: What is 392 rounded to the nearest ten? What is 392 rounded to the nearest hundred? 4. 750 Expanded form: What is 750 rounded to the nearest ten? What is 750 rounded to the nearest hundred? NOTE Student

Chapter 4 Covariance, Regression, and Correlation “Co-relation or correlation of structure” is a phrase much used in biology, and not least in that branch of it which refers to heredity, and the idea is even more freque

Cloud Computing J.B.I.E.T Page 5 Computing Paradigm Distinctions . The high-technology community has argued for many years about the precise definitions of centralized computing, parallel computing, distributed computing, and cloud computing. In general, distributed computing is the opposite of centralized computing.

Am I my Brother’s Keeper? Acts 15:19-35 Introduction: Since the beginning of time when the first man and woman rebelled against God, mankind has been separated from God. Every person since that time has been born into that rebellion and sin. Because of sin, people are separated from God and are unable to have a right relationship with Him or each other. Ill. of evil and suffering Inside of .