Chapter Basic Concepts For Multivariate Statistics

3y ago
32 Views
2 Downloads
208.26 KB
24 Pages
Last View : 14d ago
Last Download : 3m ago
Upload by : Maxton Kershaw
Transcription

Chapter1Basic Concepts forMultivariate Statistics1.11.21.31.41.51.61.7Introduction 1Population Versus Sample 2Elementary Tools for Understanding Multivariate DataData Reduction, Description, and Estimation 6Concepts from Matrix Algebra 7Multivariate Normal Distribution 21Concluding Remarks 2331.1 IntroductionData are information. Most crucial scientific, sociological, political, economic, and business decisions are made based on data analyis. Often data are available in abundance,but by themselves they are of little help unless they are summarized and an appropriateinterpretation of the summary quantities made. However, such a summary and corresponding interpretation can rarely be made just by looking at the raw data. A careful scientificscrutiny and analysis of these data can usually provide an enormous amount of valuableinformation. Often such an analysis may not be obtained just by computing simple averages. Admittedly, the more complex the data and their structure, the more involved the dataanalysis.The complexity in a data set may exist for a variety of reasons. For example, the data setmay contain too many observations that stand out and whose presence in the data cannot bejustified by any simple explanation. Such observations are often viewed as influential observations or outliers. Deciding which observation is or is not an influential one is a difficultproblem. For a brief review of some graphical and formal approaches to this problem, seeKhattree and Naik (1999). A good, detailed discussion of these topics can be found in Belsley, Kuh and Welsch (1980), Belsley (1991), Cook and Weisberg (1982), and Chatterjeeand Hadi (1988).Another situation in which a simple analysis based on averages alone may not sufficeoccurs when the data on some of the variables are correlated or when there is a trendpresent in the data. Such a situation often arises when data were collected over time. Forexample, when the data are collected on a single patient or a group of patients under a giventreatment, we are rarely interested in knowing the average response over time. What weare interested in is observing any changes in the values, that is, in observing any patternsor trends.Many times, data are collected on a number of units, and on each unit not just one, butmany variables are measured. For example, in a psychological experiment, many tests areused, and each individual is subjected to all these tests. Since these are measurements onthe same unit (an individual), these measurements (or variables) are correlated and, whilesummarizing the data on all these variables, this set of correlations (or some equivalentquantity) should be an integral part of this summary. Further, when many variables exist, in

2Multivariate Data Reduction and Discrimination with SAS Softwareorder to obtain more definite and more easily comprehensible information, this correlationsummary (and its structure) should be subjected to further analysis. There are many otherpossible ways in which a data set can be quite complex for analysis.However, it is the last situation that is of interest to us in this book. Specifically, we mayhave n individual units and on each unit we have observed (same) p different characteristics(variables), say x 1 , x 2 , . . . , x p . Then these data can be presented as an n by p matrix x 11 x 12 . . . x 1 p x 21 x 22 . . . x 2 p X . . . x n1x n2.x npOf course, the measurements in the i th row, namely, xi1 , . . . , xi p , which are the measurements on the same unit, are correlated. If we arrange them in a column vector xi definedas xi1 xi . ,xi pthen xi can be viewed as a multivariate observation. Thus, the n rows of matrix X correspond to n multivariate observations (written as rows within this matrix), and the measurements within each xi are usually correlated. There may or may not be a correlation betweencolumns x1 , . . . , xn . Usually, x1 , . . . , xn are assumed to be uncorrelated (or statisticallyindependent as a stronger assumption) but this may not always be so. For example, if xi ,i 1, . . . , n contains measurements on the height and weight of the i th brother in a familywith n brothers, then it is reasonable to assume that some kind of correlation may existbetween the rows of X as well.For much of what is considered in this book, we will not concern ourselves with thescenario in which rows of the data matrix X are also correlated. In other words, when rowsof X constitute a sample, such a sample will be assumed to be statistically independent.However, before we elaborate on this, we should briefly comment on sampling issues.1.2 Population Versus SampleAs we pointed out, the rows in the n by p data matrix X are viewed as multivariate observations on n units. If the set of these n units constitutes the entire (finite) set of all possibleunits, then we have data available on the entire reference population. An example of sucha situation is the data collected on all cities in the United States that have a population of1,000,000 or more, and on three variables, namely, cost-of-living, average annual salary,and the quality of health care facilities. Since each U.S. city that qualifies for the definitionis included, any summary of these data will be the true summary of the population.However, more often than not, the data are obtained through a survey in which, on eachof the units, all p characteristics are measured. Such a situation represents a multivariatesample. A sample (adequately or poorly) represents the underlying population from whichit is taken. As the population is now represented through only a few units taken from it,any summary derived from it merely represents the true population summary in the sensethat we hope that, generally, it will be close to the true summary, although no assuranceabout an exact match between the two can be given.How can we measure and ensure that the summary from a sample is a good representative of the population summary? To quantify it, some kinds of indexes based on probabilis-

Chapter 1Basic Concepts for Multivariate Statistics3tic ideas seem appropriate. That requires one to build some kind of probabilistic structureover these units. This is done by artificially and intentionally introducing the probabilisticstructure into the sampling scheme. Of course, since we want to ensure that the sample isa good representative of the population, the probabilistic structure should be such that ittreats all the population units in an equally fair way. Thus, we require that the sampling isdone in such a way that each unit of (finite or infinite) population has an equal chance ofbeing included in the sample. This requirement can be met by a simple random samplingwith or without replacement. It may be pointed out that in the case of a finite populationand sampling without replacement, observations are not independent, although the strengthof dependence diminishes as the sample size increases.Although a probabilistic structure is introduced over different units through randomsampling, the same cannot be done for the p different measurements, as there is neither areference population nor do all p measurements (such as weight, height, etc.) necessarilyrepresent the same thing. However, there is possibly some inherent dependence betweenthese measurements, and this dependence is often assumed and modeled as some jointprobability distribution. Thus, we view each row of X as a multivariate observation fromsome p-dimensional population that is represented by some p-dimensional multivariatedistribution. Thus, the rows of X often represent a random sample from a p-dimensionalpopulation. In much multivariate analysis work, this population is assumed to be infiniteand quite frequently it is assumed to have a multivariate normal distribution. We will brieflydiscuss the multivariate normal distribution and its properties in Section 1.6.1.3 Elementary Tools for Understanding Multivariate DataTo understand a large data set on several mutually dependent variables, we must somehowsummarize it. For univariate data, when there is only one variable under consideration,these are usually summarized by the (population or sample) mean, variance, skewness, andkurtosis. These are the basic quantities used for data description. For multivariate data, theircounterparts are defined in a similar way. However, the description is greatly simplified ifmatrix notations are used. Some of the matrix terminology used here is defined later inSection 1.5.Let x be the p by 1 random vector corresponding to the multivariate population underconsideration. If we let x1 x . ,xpthen each xi is a random variable, and we assume that x 1 , . . . , x p are possibly dependent.With E(·) representing the mathematical expectation (interpreted as the long-run average),let µi E(xi ), and let σii var(xi ) be the population variance. Further, let the populationcovariance between xi and x j be σi j cov(xi , x j ). Then we define the population meanvector E(x) as the vector of term by term expectations. That is, E(x 1 )µ1 . .E(x) . (say).E(x p )µpAdditionally, the concept of population variance is generalized to the matrix with all thepopulation variances and covariances placed appropriately within a variance-covariancematrix. Specifically, if we denote the variance-covariance matrix of x by D(x), then

4Multivariate Data Reduction and Discrimination with SAS Software D(x) var(x 1 )cov(x 2 , x 1 ).cov(x p , x 1 ) σ11σ21.σ p1cov(x 1 , x 2 )var(x 2 )cov(x p , x 2 )σ12σ22.σ1 pσ2 p.σ p2. . . σ pp. . . cov(x 1 , x p ). . . cov(x 2 , x p ). var(x p ) (σi j ) (say). That is, with the understanding that cov(xi , xi ) var(xi ) σii , the term cov(xi , x j )appears as the (i, j)th entry in matrix . Thus, the variance of the i th variable appearsat the i th diagonal place and all covariances are appropriately placed at the nondiagonalplaces. Since cov(xi , x j ) cov(x j , xi ), we have σi j σ ji for all i, j. Thus, the matrixD(x) is symmetric. The other alternative notations for D(x) are cov(x) and var(x),and it is often also referred to as the dispersion matrix, the variance-covariance matrix, orsimply the covariance matrix. We will use the three p terms interchangeably.The quantity tr( ) (read as trace of ) i 1 σii is called the total variance and (the determinant of ) is referred to as the generalized variance. The two are oftentaken as the overall measures of variability of the random vector x. However, sometimestheir use can be misleading. Specifically, the total variance tr( ) completely ignores thenondiagonal terms of that represent the covariances. At the same time, two very differentmatrices may yield the same value of the generalized variance.As there exists dependence between x 1 , . . . , x p , it is also meaningful to at least measurethe degree of linear dependence. It is often measured using the correlations. Specifically,letσi jcov(xi , x j )ρi j σii σ j jvar(xi ) var(x j )be the Pearson’s population correlation coefficient between xi and x j . Then we define thepopulation correlation matrix as ρ11 ρ12 . . . ρ1 p1ρ12 . . . ρ1 p ρ21 ρ22 . . . ρ2 p ρ211 . . . ρ pp . (ρi j ) ρ p1 ρ p2 . . . ρ ppρ p1 ρ p2 . . .1As was the case for , is also symmetric. Further, can be expressed in terms of as11 [diag( )] 2 [diag( )] 2 ,where diag( ) is the diagonal matrix obtained by retaining the diagonal elements of andby replacing all the nondiagonal elements by zero. Further, the square root of matrix A1111denoted by A 2 is a matrix satisfying A A 2 A 2 . It is defined in Section 1.5. Also, A 21represents the inverse of matrix A 2 .It may be mentioned that the variance-covariance and the correlation matrices are always nonnegative definite (See Section 1.5 for a discussion). For most of the discussion inthis book, these matrices, however, will be assumed to be positive definite. In view of thisassumption, these matrices will also admit their respective inverses.How do we generalize (and measure) the skewness and kurtosis for a multivariate population? Mardia (1970) defines these measures asmultivariate skewness: β1, p E (x ) 1 (y )3,

Chapter 1Basic Concepts for Multivariate Statistics5where x and y are independent but have the same distribution andmultivariate kurtosis: β2, p E (x ) 1 (x )2.For the univariate case, that is when p 1, β1, p reduces to the square of the coefficient ofskewness, and β2, p reduces to the coefficient of kurtosis.The quantities , , , β1, p and β2, p provide a basic summary of a multivariate population. What about the sample counterparts of these quantities? When we have a p-variaterandom sample x1 , . . . , xn of size n, then with the n by p data matrix X defined as x1 . Xn p . ,x nwe define,sample mean vector: x n 1nxi n 1 X 1n ,i 1sample variance-covariance matrix: S (n 1) 1n(xi x)(xi x) i 1 (n 1) 1nxi xi n x x i 1 (n 1) 1 X (I n 1 1n 1 n )X (n 1) 1 X X n 1 X 1n 1 n X (n 1) 1 X X nx x .It may be mentioned that often, instead of the dividing factor of (n 1) in the aboveexpressions, a dividing factor of n is used. Such a sample variance-covariance matrix isdenoted by Sn . We also have 1 1sample correlation matrix: ˆ diag(S) 2 S diag(S) 2 1 1 diag(Sn ) 2 Sn diag(Sn ) 2 ,sample multivariate skewness: β̂1, p n 2nngi3j ,i 1 j 1andsample multivariate kurtosis: β̂2, p n 1ngii2 .i 1In the above expressions, 1n denotes an n by 1 vector with all entries 1, In is an n byn identity matrix, and gi j , i, j 1, . . . p, are defined by gi j (xi x) S 1n (x j x).See Khattree and Naik (1999) for details and computational schemes to compute thesequantities. In fact, multivariate skewness and multivariate kurtosis are computed later inChapter 5, Section 5.2 to test the multivariate normality assumption on data. Correlationmatrices also play a central role in principal components analysis (Chapter 2, Section 2.2).

6Multivariate Data Reduction and Discrimination with SAS Software1.4 Data Reduction, Description, and EstimationIn the previous section, we presented some of the basic population summary quantitiesand their sample counterparts commonly referred to as descriptive statistics. The basicidea was to summarize the population or sample data through smaller sized matrices orsimply numbers. All the quantities (except correlation) defined there were straightforwardgeneralizations of their univariate counterparts. However, the multivariate data do havesome of their own unique features and needs, which do not exist in the univariate situation.Even though the idea is still the same, namely that of summarizing or describing the data,such situations call for certain unique ways of handling these, and these unique techniquesform the main theme of this book. These can best be described by a few examples.a. Based on a number of measurements such as average housing prices, cost of living,health care facilities, crime rate, etc., we would like to describe which cities in thecountry are most livable and also try to observe any unique similarities or differencesamong cities. There are several variables to be measured, and it is unlikely that attemptsto order cities with respect to any one variable will result in the same ordering if anothervariable were used. For example, a city with a low crime rate (a desirable feature) mayhave a high cost of living (an undesirable feature), and thus these variables often tendto offset each other. How do we decide which cities are the best to live in? The problemhere is that of data reduction. However, this problem can neither be described as that ofvariable selection (there is no dependent variable and no model) nor can it be viewed asa prediction problem. It is more a problem of attempting to detect and understand theunique features that the data set may contain and then to interpret them. This requiressome meaningful approach for data description. The possible analyses for such a dataset are principal component analysis (Chapter 2) and cluster analysis (Chapter 6).b. As another example, suppose we have a set of independent variables which in turn haveeffects on a large number of dependent variables. Such a situation is quite common inthe chemical industry and in economic data, where the two sets can be clearly definedas those containing input and output variables. We are not interested in individual variables, but we want to come up with a few new variables in each group. These maythemselves be functions of all variables in the respective groups, so that each new variable from one group can be paired with another new variable in the other group in somemeaningful sense, with the hope that these newly defined variables can be appropriatelyinterpreted in the context. We must emphasize that analysis is not being done with anyspecific purpose of proving or disproving some claims. It is only an attempt to understand the data. As the information is presented in terms of new variables, which arefewer in number, it is easier to observe any striking features or associations in this lattersituation. Such problems can be handled using the techniques of canonical correlation(Chapter 3) and in case of qualitative data, using correspondence analysis (Chapter 7).c. An automobile company wants to know what determines the customer’s preference forvarious cars. A sample of 100 randomly selected individuals were asked to give a scorebetween 1 (low) and 10 (high) on six variables, namely, price, reliability, status symbolrelated to car, gas mileage, safety in an accident, and average miles driven per week.What kind of analysis can be made for these data? With the assumptions that there aresome underlying hypothetical and unobservable variables on which the scores of thesesix observable variables depend, a natural inquiry would be to identify these hypothetical variables. Intuitively, safety consciousness and economic status of the individualmay be two (perhaps of several others) traits that may influence the scores on some ofthese six observable variables. Thus, some or all of the observed variables can be written as a function of, say, these two unobservable traits. A question in reverse is this: can

Chapter 1Basic Concepts for Multivariate Statistics7we quantify the unobservable traits as functions of the observable ones? Such a querycan be usually answered by factor analysis techniques (Chapter 4). Note, however, thatthe analysis provides only the functions and, their interpretations as some meaningfulunobservable trait, is left to the analyst. Nonetheless, it is again a problem of data reduction and description in that many measurements are reduced to only a few traits withthe objective of providing an appropriate description of the data.As is clear from these examples, many multivariate problems involve data reduction, description and, in the process of doing so, estimation. These issues form the focus of the nextsix chapters. As a general theme, most of the situations either require some matrix decomposition and transformations or use a distance-based approach. Distributional assumptionssuch as multivariate normality are also helpful (usually but not always, in assessing thequality of estimation) but not crucial. With that in mind in the next section we providea brief review of some important concepts from matrix theory. A review of multivariatenormality is presented in Section 1.6.1.5 Concepts from Matrix AlgebraThis section is meant only as a brief review of concepts from matrix algebra. An excellentaccount of results on matrices with a statistical viewpoint can be found in the recent booksby Schott (1996), Harville (1997) and Rao and Rao (1998). We will assume that the readeris already familiar with the addition,

Multivariate Statistics 1.1 Introduction 1 1.2 Population Versus Sample 2 1.3 Elementary Tools for Understanding Multivariate Data 3 1.4 Data Reduction, Description, and Estimation 6 1.5 Concepts from Matrix Algebra 7 1.6 Multivariate Normal Distribution 21 1.7 Concluding Remarks 23 1.1 Introduction Data are information.

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

Part One: Heir of Ash Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18 Chapter 19 Chapter 20 Chapter 21 Chapter 22 Chapter 23 Chapter 24 Chapter 25 Chapter 26 Chapter 27 Chapter 28 Chapter 29 Chapter 30 .

TO KILL A MOCKINGBIRD. Contents Dedication Epigraph Part One Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Part Two Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18. Chapter 19 Chapter 20 Chapter 21 Chapter 22 Chapter 23 Chapter 24 Chapter 25 Chapter 26

6.7.1 Multivariate projection 150 6.7.2 Validation scores 150 6.8 Exercise—detecting outliers (Troodos) 152 6.8.1 Purpose 152 6.8.2 Dataset 152 6.8.3 Analysis 153 6.8.4 Summary 156 6.9 Summary:PCAin practice 156 6.10 References 157 7. Multivariate calibration 158 7.1 Multivariate modelling (X, Y): the calibration stage 158 7.2 Multivariate .

Introduction to Multivariate methodsIntroduction to Multivariate methods – Data tables and Notation – What is a projection? – Concept of Latent Variable –“Omics” Introduction to principal component analysis 8/15/2008 3 Background Needs for multivariate data analysis Most data sets today are multivariate – due todue to

An Introduction to Multivariate Design . This simplified example represents a bivariate analysis because the design consists of exactly two dependent or measured variables. The Tricky Definition of the Multivariate Domain Some Alternative Definitions of the Multivariate Domain . “With multivariate statistics, you simultaneously analyze

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

2.1 ASTM --Standards:3 C125 Terminology Relating to Concrete and Concrete Ag- ates - ,, ,, , ,, greg- C138/C138M Test Method for Density (Unit Weight), Yield, and Air Content (Gravimetric) of Concrete C143/C143M Test Method for Slump of Hydraulic-Cement Concrete C172/C172M Practice for Sampling Freshly Mixed Con- ,, ,, , , , , , ,--crete C173/C173M Test Method for Air Content of .