Multivariate Statistics With R - UniFI

2y ago
46 Views
2 Downloads
1.82 MB
189 Pages
Last View : 12d ago
Last Download : 3m ago
Upload by : Raelyn Goode
Transcription

Multivariate Statistics with RPaul J. HewsonMarch 17, 2009

Multivariate StatisticsChapter 0c Paul Hewsonii

Contents1 Multivariate data11.1The nature of multivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.2The role of multivariate investigations . . . . . . . . . . . . . . . . . . . . . . . . .11.3Summarising multivariate data (presenting data as a matrix, mean vectors, covariance1.41.5matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.3.1Data display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2Graphical and dynamic graphical methods . . . . . . . . . . . . . . . . . . . . . . .31.4.1Chernoff’s Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.4.2Scatterplots, pairwise scatterplots (draftsman plots) . . . . . . . . . . . . . .51.4.3Optional: 3d scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . .51.4.4Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6Animated exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82 Matrix manipulation2.111Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112.1.1Vector multiplication; the inner product . . . . . . . . . . . . . . . . . . . .122.1.2Outer product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132.1.3Vector length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13iii

Multivariate StatisticsChapter 02.1.4Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .142.1.5Cauchy-Schwartz Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . .142.1.6Angle between vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.2.1Transposing matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.2.2Some special matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .162.2.3Equality and addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182.2.4Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19Crossproduct matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .222.3.1Powers of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .252.3.2Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .252.3.3Rank of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .272.4Matrix inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .272.5Eigen values and eigen vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .282.6Singular Value Decomposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . .292.7Extended Cauchy-Schwarz Inequality . . . . . . . . . . . . . . . . . . . . . . . . . .302.8Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .302.9Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .302.22.33 Measures of distance3.133Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .333.1.1Distributional properties of the Mahalanobis distance . . . . . . . . . . . . .353.2Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .373.3Distance between points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38c Paul Hewsoniv

Multivariate Statistics3.4Chapter 03.3.1Quantitative variables - Interval scaled . . . . . . . . . . . . . . . . . . . . .383.3.2Distance between variables . . . . . . . . . . . . . . . . . . . . . . . . . . .403.3.3Quantitative variables: Ratio Scaled . . . . . . . . . . . . . . . . . . . . . .423.3.4Dichotomous data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .423.3.5Qualitative variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .463.3.6Different variable types . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47Properties of proximity matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . .484 Cluster analysis4.151Introduction to agglomerative hierarchical cluster analysis . . . . . . . . . . . . . . .544.1.1Nearest neighbour / Single Linkage . . . . . . . . . . . . . . . . . . . . . . .544.1.2Furthest neighbour / Complete linkage . . . . . . . . . . . . . . . . . . . . .554.1.3Group average link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .574.1.4Alternative methods for hierarchical cluster analysis . . . . . . . . . . . . . .584.1.5Problems with hierarchical cluster analysis . . . . . . . . . . . . . . . . . . .594.1.6Hierarchical clustering in R. . . . . . . . . . . . . . . . . . . . . . . . . .604.2Cophenetic Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .624.3Divisive hierarchical clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . .624.4K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .634.4.1Partitioning around medoids . . . . . . . . . . . . . . . . . . . . . . . . . .664.4.2Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .674.5K-centroids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .684.6Further information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69c Paul Hewsonv

Multivariate StatisticsChapter 05 Multidimensional scaling5.171Metric Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .715.1.1Similarities with principal components analysis . . . . . . . . . . . . . . . . .735.2Visualising multivariate distance . . . . . . . . . . . . . . . . . . . . . . . . . . . .755.3Assessing the quality of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .755.3.177Sammon Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Multivariate normality796.1Expectations and moments of continuous random functions . . . . . . . . . . . . . .796.3Multivariate normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .806.5.1R estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .816.67 Inference for the mean857.1Two sample Hotelling’s T2 test . . . . . . . . . . . . . . . . . . . . . . . . . . . . .867.2Constant Density Ellipses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .897.3Multivariate Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . .918 Discriminant analysis958.1Fisher discimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .978.2Accuracy of discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .988.3Importance of variables in discrimination . . . . . . . . . . . . . . . . . . . . . . . .998.4Canonical discriminant functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .998.5Linear discrimination - a worked example . . . . . . . . . . . . . . . . . . . . . . . . 1009 Principal component analysis101c Paul Hewsonvi

Multivariate Statistics9.1Chapter 0Derivation of Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . 1039.1.1A little geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059.1.2Principal Component Stability . . . . . . . . . . . . . . . . . . . . . . . . . 1089.2Some properties of principal components . . . . . . . . . . . . . . . . . . . . . . . . 1109.8Illustration of Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . 1129.99.8.1An illustration with the Sydney Heptatholon data . . . . . . . . . . . . . . . 1129.8.2Principal component scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 1139.8.3Prepackaged PCA function 1: princomp() . . . . . . . . . . . . . . . . . . 1149.8.4Inbuilt functions 2: prcomp() . . . . . . . . . . . . . . . . . . . . . . . . . 115Principal Components Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1179.10 “Model” criticism for principal components analysis . . . . . . . . . . . . . . . . . . 1179.10.1 Distribution theory for the Eigenvalues and Eigenvectors of a covariance matrix1189.13 Sphericity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219.15.1 Partial sphericity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1239.22 How many components to retain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289.22.1 Data analytic diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289.23.1 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1339.23.2 Forward search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389.23.3 Assessing multivariate normality . . . . . . . . . . . . . . . . . . . . . . . . 1389.25 Interpreting the principal components. . . . . . . . . . . . . . . . . . . . . . . . . 1419.27 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14110 Canonical Correlation14310.1 Canonical variates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143c Paul Hewsonvii

Multivariate StatisticsChapter 010.2 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14310.3 Computer example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14410.3.1 Interpreting the canonical variables . . . . . . . . . . . . . . . . . . . . . . . 14710.3.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14711 Factor analysis14911.1 Role of factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14911.2 The factor analysis model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15011.2.1 Centred and standardised data . . . . . . . . . . . . . . . . . . . . . . . . . 15211.2.2 Factor indeterminacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15311.2.3 Strategy for factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 15311.3 Principal component extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15311.3.1 Diagnostics for the factor model . . . . . . . . . . . . . . . . . . . . . . . . 15811.3.2 Principal Factor solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16111.4 Maximum likelihood solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16211.5 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16611.6 Factor scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168Bibliography170c Paul Hewsonviii

BooksMany of the statistical analyses encountered to date consist of a single response variable and oneor more explanatory variables. In this latter case, multiple regression, we regressed a single response(dependent) variable on a number of explanatory (independent) variables. This is occasionally referredto as “multivariate regression” which is all rather unfortunate. There isn’t an entirely clear “canon” ofwhat is a multivariate technique and what isn’t (one could argue that discriminant analysis involves asingle dependent variable). However, we are going to consider the simultaneous analysis of a numberof related variables. We may approach this in one of two ways. The first group of problems relatesto classification, where attention is focussed on individuals who are more alike. In unsupervisedclassification (cluster analysis) we are concerned with a range of algorithms that at least try toidentify individuals who are more alike if not to distinguish clear groups of individuals. There are alsoa wide range of scaling techniques which help us visualise these differences in lower dimensionality. Insupervised classification (discriminant analysis) we already have information on group membership,and wish to develop rules from the data to classify future observations.The other group of problems concerns inter-relationships between variables. Again, we may beinterested in lower dimension that help us visualise a given dataset. Alternatively, we may be interestedto see how one group of variables is correlated with another group of variables. Finally, we may beinterested in models for the interrelationships between variables.This book is still a work in progress. Currently it contains material used as notes to support amodule at the University of Plymouth, where we work in conjunction with Johnson and Wichern(1998). It covers a reasonably established range of multivariate techniques. There isn’t however aclear “canon” of multivariate techniques, and some of the following books may also be of interest:Other Introductory level books: Afifi and Clark (1990) Chatfield and Collins (1980) Dillon and Goldstein (1984) Everitt and Dunn (1991)ix

Multivariate StatisticsChapter 0 Flury and Riedwyl (1988) Johnson (1998) Kendall (1975) Hair et al. (1995) et al. (1998) Manly (1994)Intermediate level books: Flury (1997) (My personal favourite) Gnanadesikan (1997) Harris (1985) Krzanowski (2000) ?Krzanowski and Marriott (1994b) Rencher (2002) Morrison (2005) Seber (1984) Timm (1975)More advanced books: Anderson (1984) Bilodeau and Brenner (1999) Giri (2003) Mardia et al. (1979) Muirhead (York) Press (1982) Srivastava and Carter (1983)c Paul Hewsonx

Multivariate StatisticsChapter 0Some authors include contingency tables and log-linear modelling, others exclude Cluster analysis.Given that multivariate methods are particularly common in applied areas such Ecology and Psychology, there is further reading aimed at these subjects. It is quite possible that they will have veryreadable descriptions of particular techniques.Whilst this book is still an alpha-version work in progress, the aim is(a) To cover a basic core of multivariate material in such a way that the core mathematical principlesare covered(b) To provide access to current applications and developmentsThere is little material included yet for (b) (although sketch notes are being worked on). Comments,feedback, corrections, co-authors are all welcome.c Paul Hewsonxi

Multivariate StatisticsChapter 0c Paul Hewsonxii

Chapter 1Multivariate data1.1The nature of multivariate dataWe will attempt to clarify what we mean by multivariate analysis in the next section, however it isworth noting that much of the data examined is observational rather than collected from designedexperiments. It is also apparent that much of the methodology has been developed outside thestatistical literature. Our primary interest will be in examining continuous data, the only exceptionbeing categorical variables indicating group membership. This may be slightly limiting, but we willalso tend to rely on at least asymptotic approximations to (multivariate) normality, although theseare not always necessary for some techniques. The multivariate normal distribution is a fascinatingsubject in its own right, and experience (supplemented with some brutal transformations) indicatesit is a reasonable basis for much work. Nevertheless, there is considerable interest in robust methodsat the moment and we refer to some of theseapproaches where possible.1.2The role of multivariate investigationsIf we assume that linear and generalised linear models (and their descendants) are the mainstayof statistical practice, there is a sense in which most statistical analysis is multivariate. However,multivariate analysis has come to define a canon of methods which could be characterised by theiruse of the dependence structure between a large number of variables. This canon has not yet beenfirmly established; we attempt one definition of it here but omit some methods others would includeand include some methods others would omit. We would suggest that multivariate analysis haseither the units as a primary focus, or involves an assessment primarily of the variables. Whenconsidering the units, we usually refer to techniques for classification; supervised classfication if we1

Multivariate StatisticsChapter 1already understand the grouping and unsupervised classification where we have no a priori knowledgeof any groupings within our observed units. The multivariate methodology at the core of supervisedclassification is discriminant analysis, although the machine learning community has developed manyother approaches to the same task. We will consider these techniques in the light of hypothesistests (Hotelling’s T2 test and Multivariate Analysis of Variance) which might help us determinewhether groupings within our data really are distinct. Unsupervised classification has traditionallybeen associated with cluster analysis, a wide range of algorithms which attempt to find structure indata. It is perhaps cluster analysis that is the most often contested component of our multivariatecanon - some authorities prefer approaches based less on automated algorithms and rather more onstatistical models and would argue for approaches such as mixture models and perhaps latent classanalysis. Given the reliance of cluster analysis on distance measures, we will also consider scalingtechniques as a method of visualising distnace.In considering the relationship between variables, we will spend some time exploring principal components, the most misused of all multivariate techniques which we consider primarily as a projectiontechnique. Some coverage will also be given to canonical correlation, an attempt to understand therelationship between two sets of variables. Finallly, we will consider factor analysis, a much contestedtechnique in statistical circles but a much used one in applied settings.In order to make some sense of these techniques, we will present a brief overview of linear algebra asit pertains to the techniques we wish to explore, and will present some properties of the multivariatenormal distribution.1.3Summarising multivariate data (presenting data as amatrix, mean vectors, covariance matricesA number of datasets will be used thoughout the course, where these are not available within R itselfthey will be posted in the student portal. For now, consider the USArrests data. This was publishedby McNeil, D. R. (1977) “Interactive Data Analysis”, Wiley, and gives Arrest rates in 1973 (derivedfrom World Almanac and Book of facts 1975. and Urban population rates derived from StatisticalAbstracts of the United States 1975. We therefore consider data on “Murder” (arrests per 100,000),Assault (arrests per 100,000), Rape (arrests per 100,000) and the percentage of the population livingin urban areas in each state.1.3.1Data displayA matrix is a convenient way of arranging such data.c Paul Hewson2

Multivariate StatisticsChapter 1.State M urder Assault Alabama13.2 236 Alaska10.0263 Arizona8.1294 Arkansas8.8 190 Calif ornia9.0276 Colorado7.9 204 Connecticut 3.3 110 Delaware5.9 238 F lorida15.4 335 Georgia17.4 211 Hawaii5.3 46. U rbanP op(%) 5848 80 50 91 78 77 72 70 60 83 Note in total that there are 50 states, (this display had been cut off after the 11th row, Hawaii), andthat there are four variables. Have a look at the USArrests data itself, and the associated helpfile: ?USArrests summary(USArrests) USArrests1.41.4.1Graphical and dynamic graphical methodsChernoff ’s FacesOne of the more charismatic ways of presenting multivariate data was proposed by Chernoff, H. (1973)“The use of faces to represent statistical association”, JASA, 68, pp 361-368. (see www.wiwi.unibielefeld.de/ wolf/ for the R code to create these). If you have loaded the mvmmisc.R file, you canget the

Multivariate data 1.1 The nature of multivariate data We will attempt to clarify what we mean by multivariate analysis in the next section, however it is worth noting that much of the data examined is observational rather than collected from designed experiments. It is also apparent th

Related Documents:

Enterprise VoIP Phone with 5" Touchsc reen, Bluetooth, WiFi, and Built-In Camera Model: UVP-Pro UniFi VoIP Phone Professional Quick Start Guide Installation Requirements UniFi Switch or PoE (802.3at-compliant) switch UniFi Controller UniFi Security Gateway Note: The PBX functionality in the UniFi Security Gateway only

Remote Access to UniFi Network 50 50 5 0 The UniFi 24 PoE Switch powers the UniFi nano HD AP, UniFi HD In-Wall AP, and G4 Pro Camera. Overview Expand and power your network with the UniFi PoE Switch, part of the Ubiquiti UniFi Enterprise System. It is available in the following models: USW-16-PoE 16 RJ45 ports with 2 SFP ports

Hard Drive Capacity 1 TB 2.5" SATA HDD (user upgradeable*) Device Capacity UniFi Protect Mode UniFi Network Protect Up to 20 UniFi Cameras Up to 15 UniFi Cameras and 50 UniFi Devices eMMC Memory 32 GB Networking Interface (1) 10/100/1000 Ethernet Po

The UniFi LED App allows you to control and configure your UniFi LED lighting system from a mobile device, such as a smartphone (iOS or Android) or tablet. The app has all of the features of the UniFi LED Controller, including device discovery,

Upload a map of your location(s) or use Google Maps to represent the areas where your UniFi devices are located. Starting with v5.6.x, you can also use the predictive map feature to get a preview of coverage, so you can help avoid dead spots. Device Configuration The Devices screen displays a list of UniFi devices discovered by the UniFi .

the UniFi mobile app (iOS or Android), the UniFi Controller software is a powerful software engine ideal for high-density client deployments requiring low latency and high uptime performance. Use the UniFi Controller software to quickly

A 4x4 Wave 2 AP delivers up to 33% greater performance1 than a Wave 1 AP that is 3x3 in both radio bands. Real-World Performance The UniFi HD AP is the first UniFi 802.11ac Wave 2 AP. Combining the performance increases from MU-MIMO technology and the use of 4x4 spatial streams, the UniFi HD AP delivers up to 125%

5 SUGGESTED READINGS Smith, G.M. 1971. Cryptogamic Botny. Vol.I Algae & Fungi. Tata McGraw Hill Publishing Co., New Delhi. Sharma, O.P. 1992.