an introduction toPrincipal Component Analysis(PCA)
abstractPrincipal component analysis (PCA) is a technique that is useful for thecompression and classification of data. The purpose is to reduce thedimensionality of a data set (sample) by finding a new set of variables,smaller than the original set of variables, that nonetheless retains mostof the sample's information.By information we mean the variation present in the sample,given by the correlations between the original variables. The newvariables, called principal components (PCs), are uncorrelated, and areordered by the fraction of the total information each retains.
overview geometric picture of PCs algebraic definition and derivation of PCs usage of PCA astronomical application
Geometric picture of principal components (PCs)A sample of n observations in the 2-D spaceGoal: to account for the variation in a samplein as few variables as possible, to some accuracy
Geometric picture of principal components (PCs) the 1st PCis a minimum distance fit to a line in the 2nd PCis a minimum distance fit to a linein the plane perpendicular to the 1st PCspacePCs are a series of linear least squares fits to a sample,each orthogonal to all the previous.
Algebraic definition of PCsGiven a sample of n observations on a vector of p variablesdefine the first principal component of the sampleλby the linear transformationwhere the vectoris chosen such thatis maximum
Algebraic definition of PCsLikewise, define the kth PC of the sampleby the linear transformationwhere the vectoris chosen such thatsubject toand toλis maximum
Algebraic derivation of coefficient vectorsTo findfirst note thatwhereis the covariance matrix for the variables
Algebraic derivation of coefficient vectorsTo findmaximizesubject toLet λ be a Lagrange multiplierthen maximizeby differentiating thereforeis an eigenvector ofcorresponding to eigenvalue
Algebraic derivation ofWe have maximizedSois the largest eigenvalue ofThe first PCretains the greatest amount of variation in the sample.
Algebraic derivation of coefficient vectorsTo find the next coefficient vectormaximizesubject toand toFirst note thatthen let λ and φ be Lagrange multipliers, and maximize
Algebraic derivation of coefficient vectorsWe find thatwhose eigenvalueis also an eigenvector ofis the second largest.In general The kth largest eigenvalue ofis the variance of the kth PC. The kth PCretains the kth greatest fractionof the variation in the sample.
Algebraic formulation of PCAGiven a sample of n observationson a vector of p variablesdefine a vector of p PCsaccording towhereis an orthogonal p x p matrixwhose kth column is the kth eigenvectorThenofis the covariance matrix of the PCs,being diagonal with elements
usage of PCA: Probability distribution for sample PCsIf(i) the n observations of(ii)in the sample are independent &is drawn from an underlying population thatfollows a p-variate normal (Gaussian) distributionwith known covariance matrixthenwhereelseis the Wishart distributionutilize a bootstrap approximation
usage of PCA: Probability distribution for sample PCsIf(i)follows a Wishart distribution &(ii) the population eigenvaluesthenare all distinctthe following results hold as all theare independent of all theare jointly normally distributed(a tilde denotes a population quantity)
usage of PCA: Probability distribution for sample PCsand (a tilde denotes a population quantity)
usage of PCA: Inference about population PCsIfthenfollows a p-variate normal distributionanalytic expressions exist* forMLE’s of,, andconfidence intervals forhypothesis testing forelseandandbootstrap and jackknife approximations exist*see references, esp. Jolliffe
usage of PCA: Practical computation of PCsIn general it is useful to define standardized variables byIftheare each measured about their sample meanthenthe covariance matrixofwill be equal to the correlation matrix ofandthe PCswill be dimensionless
usage of PCA: Practical computation of PCsGiven a sample of n observations on a vector(each measured about its sample mean)compute the covariance matrixwhereis the n x p matrixwhose ith row is the ith obsv.Then compute the n x p matrixwhose ith row is the PC scorefor the ith observation.of p variables
usage of PCA: Practical computation of PCsWriteto decompose each observation into PCs
usage of PCA: Data compressionBecause the kth PC retains the kth greatest fraction of the variationwe can approximate each observationby truncating the sum at the first m p PCs
usage of PCA: Data compressionReduce the dimensionality of the datafrom p to m p by approximatingwhereis the n x m portion ofandis the p x m portion of
astronomical application: PCs for elliptical galaxiesRotating to PC in BT – Σ space improves Faber-Jackson relationas a distance indicatorDressler, et al. 1987
astronomical application: Eigenspectra (KL transform)Connolly, et al. 1995
referencesConnolly, and Szalay, et al., “Spectral Classification of Galaxies: An Orthogonal Approach”, AJ, 110, 1071-1082, 1995.Dressler, et al., “Spectroscopy and Photometry of Elliptical Galaxies. I. A New Distance Estimator”, ApJ, 313, 42-58, 1987.Efstathiou, G., and Fall, S.M., “Multivariate analysis of elliptical galaxies”, MNRAS, 206, 453-464, 1984.Johnston, D.E., et al., “SDSS J0903 5028: A New Gravitational Lens”, AJ, 126, 2281-2290, 2003.Jolliffe, Ian T., 2002, Principal Component Analysis (Springer-Verlag New York, Secaucus, NJ).Lupton, R., 1993, Statistics In Theory and Practice (Princeton University Press, Princeton, NJ).Murtagh, F., and Heck, A., Multivariate Data Analysis (D. Reidel Publishing Company, Dordrecht, Holland).Yip, C.W., and Szalay, A.S., et al., “Distributions of Galaxy Spectral Types in the SDSS”, AJ, 128, 585-609, 2004.
an introduction to Principal Component Analysis (PCA) abstract Principal component analysis (PCA) is a technique that is useful for the compression and classification of data. The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables,