Use R!Series Editors:Robert Gentleman Kurt Hornik Giovanni ParmigianiFor other titles published in this series, go tohttp://www.springer.com/series/6991
Brian Everitt Torsten HothornAn Introduction to AppliedMultivariate Analysis with R
Brian EverittProfessor EmeritusKing’s CollegeLondon, SE5 8AFUKbrian.everitt@btopenworld.comSeries Editors:Robert GentlemanProgram in Computational BiologyDivision of Public Health SciencesFred Hutchinson Cancer Research Center1100 Fairview Avenue, N. M2-B876Seattle, Washington 98109USATorsten HothornInstitut für StatistikLudwig-Maximilians-Universität MünchenLudwigstr. 3380539 eKurt HornikDepartment of Statistik and MathematikWirtschaftsuniversität WienAugasse 2-6A-1090 WienAustriaGiovanni ParmigianiThe Sidney Kimmel ComprehensiveCancer Center at Johns Hopkins University550 North BroadwayBaltimore, MD 21205-2011USAISBN 978-1-4419-9649-7e-ISBN 978-1-4419-9650-3DOI 10.1007/978-1-4419-9650-3Springer New York Dordrecht Heidelberg LondonLibrary of Congress Control Number: 2011926793 Springer Science Business Media, LLC 2011All rights reserved. This work may not be translated or copied in whole or in part without the writtenpermission of the publisher (Springer Science Business Media, LLC, 233 Spring Street, New York,NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use inconnection with any form of information storage and retrieval, electronic adaptation, computersoftware, or by similar or dissimilar methodology now known or hereafter developed is forbidden.The use in this publication of trade names, trademarks, service marks, and similar terms, even if theyare not identified as such, is not to be taken as an expression of opinion as to whether or not they aresubject to proprietary rights.Printed on acid-free paperSpringer is part of Springer Science Business Media (www.springer.com)
To our wives, Mary-Elizabeth and Carolin.
PrefaceThe majority of data sets collected by researchers in all disciplines are multivariate, meaning that several measurements, observations, or recordings aretaken on each of the units in the data set. These units might be human subjects, archaeological artifacts, countries, or a vast variety of other things. In afew cases, it may be sensible to isolate each variable and study it separately,but in most instances all the variables need to be examined simultaneouslyin order to fully grasp the structure and key features of the data. For thispurpose, one or another method of multivariate analysis might be helpful,and it is with such methods that this book is largely concerned. Multivariateanalysis includes methods both for describing and exploring such data and formaking formal inferences about them. The aim of all the techniques is, in ageneral sense, to display or extract the signal in the data in the presence ofnoise and to find out what the data show us in the midst of their apparentchaos.The computations involved in applying most multivariate techniques areconsiderable, and their routine use requires a suitable software package. Inaddition, most analyses of multivariate data should involve the constructionof appropriate graphs and diagrams, and this will also need to be carriedout using the same package. R is a statistical computing environment that ispowerful, flexible, and, in addition, has excellent graphical facilities. It is forthese reasons that it is the use of R for multivariate analysis that is illustratedin this book.In this book, we concentrate on what might be termed the “core” or “classical” multivariate methodology, although mention will be made of recent developments where these are considered relevant and useful. But there is anarea of multivariate statistics that we have omitted from this book, and thatis multivariate analysis of variance (MANOVA) and related techniques such asFisher’s linear discriminant function (LDF). There are a variety of reasons forthis omission. First, we are not convinced that MANOVA is now of much morethan historical interest; researchers may occasionally pay lip service to usingthe technique, but in most cases it really is no more than this. They quickly
viiiPrefacemove on to looking at the results for individual variables. And MANOVA forrepeated measures has been largely superseded by the models that we shalldescribe in Chapter 8. Second, a classification technique such as LDF needsto be considered in the context of modern classification algorithms, and thesecannot be covered in an introductory book such as this.Some brief details of the theory behind each technique described are given,but the main concern of each chapter is the correct application of the methods so as to extract as much information as possible from the data at hand,particularly as some type of graphical representation, via the R software.The book is aimed at students in applied statistics courses, both undergraduate and post-graduate, who have attended a good introductory coursein statistics that covered hypothesis testing, confidence intervals, simple regression and correlation, analysis of variance, and basic maximum likelihoodestimation. We also assume that readers will know some simple matrix algebra, including the manipulation of matrices and vectors and the concepts ofthe inverse and rank of a matrix. In addition, we assume that readers willhave some familiarity with R at the level of, say, Dalgaard (2002). In additionto such a student readership, we hope that many applied statisticians dealingwith multivariate data will find something of interest in the eight chapters ofour book.Throughout the book, we give many examples of R code used to apply themultivariate techniques to multivariate data. Samples of code that could beentered interactively at the R command line are formatted as follows:R library("MVA")Here, R denotes the prompt sign from the R command line, and the userenters everything else. The symbol indicates additional lines, which areappropriately indented. Finally, output produced by function calls is shownbelow the associated code:R rnorm(10)[1] 1.8808 0.2572 -0.3412[8] -0.2993 -0.7355 0.89600.40810.43440.70031.8944In this book, we use several R packages to access different example data sets(many of them contained in the package HSAUR2), standard functions for thegeneral parametric analyses, and the MVA package to perform analyses. All ofthe packages used in this book are available at the Comprehensive R ArchiveNetwork (CRAN), which can be accessed from http://CRAN.R-project.org.The source code for the analyses presented in this book is available fromthe MVA package. A demo containing the R code to reproduce the individualresults is available for each chapter by invokingR library("MVA")R demo("Ch-MVA") ### Introduction to Multivariate AnalysisR demo("Ch-Viz") ### Visualization
PrefaceR R R R R R #ixPrincipal Components AnalysisExploratory Factor AnalysisMultidimensional ScalingCluster AnalysisStructural Equation ModelsLinear Mixed-Effects ModelsThanks are due to Lisa Möst, BSc., for help with data processing andLATEX typesetting, the copy editor for many helpful corrections, and to JohnKimmel, for all his support and patience during the writing of the book.January 2011Brian S. Everitt, LondonTorsten Hothorn, München
ContentsPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii1Multivariate Data and Multivariate Analysis . . . . . . . . . . . . . .1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.2 A brief history of the development of multivariate analysis . . . .1.3 Types of variables and the possible problem of missing values .1.3.1 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.4 Some multivariate data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.5 Covariances, correlations, and distances . . . . . . . . . . . . . . . . . . . .1.5.1 Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.5.2 Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.5.3 Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.6 The multivariate normal density function . . . . . . . . . . . . . . . . . . .1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113457121214141523232Looking at Multivariate Data: Visualisation . . . . . . . . . . . . . . .2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2 The scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2.1 The bivariate boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2.2 The convex hull of bivariate data . . . . . . . . . . . . . . . . . . . .2.2.3 The chi-plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.3 The bubble and other glyph plots . . . . . . . . . . . . . . . . . . . . . . . . .2.4 The scatterplot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.5 Enhancing the scatterplot with estimated bivariate densities . .2.5.1 Kernel density estimators . . . . . . . . . . . . . . . . . . . . . . . . . .2.6 Three-dimensional plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.7 Trellis graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.8 Stalactite plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .252526283234343942424750535660
xiiContents3Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.2 Principal components analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . 613.3 Finding the sample principal components . . . . . . . . . . . . . . . . . . . 633.4 Should principal components be extracted from thecovariance or the correlation matrix? . . . . . . . . . . . . . . . . . . . . . . . 653.5 Principal components of bivariate data with correlationcoefficient r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.6 Rescaling the principal components . . . . . . . . . . . . . . . . . . . . . . . . 703.7 How the principal components predict the observedcovariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.8 Choosing the number of components . . . . . . . . . . . . . . . . . . . . . . . 713.9 Calculating principal components scores . . . . . . . . . . . . . . . . . . . . 723.10 Some examples of the application of principal componentsanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.10.1 Head lengths of first and second sons . . . . . . . . . . . . . . . . 743.10.2 Olympic heptathlon results . . . . . . . . . . . . . . . . . . . . . . . . . 783.10.3 Air pollution in US cities . . . . . . . . . . . . . . . . . . . . . . . . . . . 863.11 The biplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923.12 Sample size for principal components analysis . . . . . . . . . . . . . . . 933.13 Canonical correlation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943.13.1 Head measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963.13.2 Health and personality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.2 Models for proximity data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.3 Spatial models for proximities: Multidimensional scaling . . . . . . 1064.4 Classical multidimensional scaling . . . . . . . . . . . . . . . . . . . . . . . . . 1064.4.1 Classical multidimensional scaling: Technical details . . . 1074.4.2 Examples of classical multidimensional scaling . . . . . . . . 1104.5 Non-metric multidimensional scaling . . . . . . . . . . . . . . . . . . . . . . . 1214.5.1 House of Representatives voting . . . . . . . . . . . . . . . . . . . . . 1234.5.2 Judgements of World War II leaders . . . . . . . . . . . . . . . . . 1244.6 Correspondence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274.6.1 Teenage relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1304.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1314.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325Exploratory Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.2 A simple example of a factor analysis model . . . . . . . . . . . . . . . . 1365.3 The k-factor analysis model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Contentsxiii5.4 Scale invariance of the k-factor model . . . . . . . . . . . . . . . . . . . . . . 1385.5 Estimating the parameters in the k-factor analysis model . . . . . 1395.5.1 Principal factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1415.5.2 Maximum likelihood factor analysis . . . . . . . . . . . . . . . . . . 1425.6 Estimating the number of factors . . . . . . . . . . . . . . . . . . . . . . . . . . 1425.7 Factor rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435.8 Estimating factor scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1475.9 Two examples of exploratory factor analysis . . . . . . . . . . . . . . . . 1485.9.1 Expectations of life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1485.9.2 Drug use by American college students . . . . . . . . . . . . . . . 1515.10 Factor analysis and principal components analysis compared . . 1575.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1596Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.2 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1656.3 Agglomerative hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . 1666.3.1 Clustering jet fighters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1716.4 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1756.4.1 Clustering the states of the USA on the basis of theircrime rate profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1766.4.2 Clustering Romano-British pottery . . . . . . . . . . . . . . . . . . 1806.5 Model-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1836.5.1 Finite mixture densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1866.5.2 Maximum likelihood estimation in a finite mixturedensity with multivariate normal components . . . . . . . . . 1876.6 Displaying clustering solutions graphically . . . . . . . . . . . . . . . . . . 1916.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1976.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2007Confirmatory Factor Analysis and Structural EquationModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2017.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2017.2 Estimation, identification, and assessing fit for confirmatoryfactor and structural equation models . . . . . . . . . . . . . . . . . . . . . . 2027.2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2027.2.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2037.2.3 Assessing the fit of a model . . . . . . . . . . . . . . . . . . . . . . . . . 2047.3 Confirmatory factor analysis models . . . . . . . . . . . . . . . . . . . . . . . 2067.3.1 Ability and aspiration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2067.3.2 A confirmatory factor analysis model for drug use . . . . . 2117.4 Structural equation models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2167.4.1 Stability of alienation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2167.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
xivContents7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2238The Analysis of Repeated Measures Data . . . . . . . . . . . . . . . . . . 2258.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2258.2 Linear mixed-effects models for repeated measures data . . . . . . 2328.2.1 Random intercept and random intercept and slopemodels for the timber slippage data . . . . . . . . . . . . . . . . . . 2338.2.2 Applying the random intercept and the randomintercept and slope models to the timber slippage data . 2358.2.3 Fitting random-effect models to the glucose challengedata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2408.3 Prediction of random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2478.4 Dropouts in longitudinal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2488.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2578.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
1Multivariate Data and Multivariate Analysis1.1 IntroductionMultivariate data arise when researchers record the values of several randomvariables on a number of subjects or objects or perhaps one of a variety ofother things (we will use the general term “units”) in which they are interested,leading to a vector-valued or multidimensional observation for each. Such dataare collected in a wide range of disciplines, and indeed it is probably reasonableto claim that the majority of data sets met in practise are multivariate. Insome studies, the variables are chosen b
area of multivariate statistics that we have omitted from this book, and that is multivariate analysis of variance (MANOVA) and related techniques such as Fisher’s linear discriminant function (LDF).
Introduction to Multivariate methodsIntroduction to Multivariate methods – Data tables and Notation – What is a projection? – Concept of Latent Variable –“Omics” Introduction to principal component analysis 8/15/2008 3 Background Needs for multivariate data analysis Most data sets today are multivariate – due todue to
An Introduction to Multivariate Design . This simplified example represents a bivariate analysis because the design consists of exactly two dependent or measured variables. The Tricky Definition of the Multivariate Domain Some Alternative Definitions of the Multivariate Domain . “With multivariate statistics, you simultaneously analyze
6.7.1 Multivariate projection 150 6.7.2 Validation scores 150 6.8 Exercise—detecting outliers (Troodos) 152 6.8.1 Purpose 152 6.8.2 Dataset 152 6.8.3 Analysis 153 6.8.4 Summary 156 6.9 Summary:PCAin practice 156 6.10 References 157 7. Multivariate calibration 158 7.1 Multivariate modelling (X, Y): the calibration stage 158 7.2 Multivariate .
Multivariate Statistics 1.1 Introduction 1 1.2 Population Versus Sample 2 1.3 Elementary Tools for Understanding Multivariate Data 3 1.4 Data Reduction, Description, and Estimation 6 1.5 Concepts from Matrix Algebra 7 1.6 Multivariate Normal Distribution 21 1.7 Concluding Remarks 23 1.1 Introduction Data are information.
APPLIED MULTIVARIATE STATISTICS FOR THE SOCIAL SCIENCES Now in its 6th edition, the authoritative textbook Applied Multivariate Statistics for the Social Sciences, continues to provide advanced students with a practical and con- ceptual understanding of s
Multivariate data 1.1 The nature of multivariate data We will attempt to clarify what we mean by multivariate analysis in the next section, however it is worth noting that much of the data examined is observational rather than collected from designed experiments. It is also apparent th
Multivariate calibration has received significant attention in analytical chemistry, particularly in spectroscopy. Martens and Naesl provide an excellent general reference on multivariate calibration. Examples of multivariate calibration in a spectroscopic context are associated w
Multivariate longitudinal analysis for actuarial applications We intend to explore actuarial-related problems within multivariate longitudinal context, and apply our proposed methodology. NOTE: Our results are very preliminary at this stage. P. Kumara and E.A. Valdez, U of Connecticut Multivariate longitudinal data analysis 5/28