The Prediction Error In CLS And PLS: The Importance Of .

2y ago
10 Views
3 Downloads
224.15 KB
12 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Madison Stoltz
Transcription

JOURNAL OF CHEMOMETRICSJ. Chemometrics 2005; 19: 107–118Published online 22 September 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.915The prediction error in CLS and PLS: the importanceof feature selection prior to multivariate calibrationBoaz Nadler* and Ronald R. CoifmanDepartment of Mathematics,Yale University,New Haven,CT 06520,USAReceived 18 September 2004; Revised 25 June 2005; Accepted 25 June 2005Classical least squares (CLS) and partial least squares (PLS) are two common multivariate regressionalgorithms in chemometrics. This paper presents an asymptotically exact mathematical analysis ofthe mean squared error of prediction of CLS and PLS under the linear mixture model commonlyassumed in spectroscopy. For CLS regression with a very large calibration set the root mean squarederror is approximately equal to the noise per wavelength divided by the length of the net analytesignal vector. It is shown, however, that for a finite training set with n samples in p dimensionsthere are additional error terms that depend on r2 p2 n2 , where r is the noise level per co-ordinate.Therefore in the ‘large p—small n’ regime, common in spectroscopy, these terms can be quite largeand even dominate the overall prediction error. It is demonstrated both theoretically and bysimulations that dimensional reduction of the input data via their compact representation with afew features, selected for example by adaptive wavelet compression, can substantially decrease theseeffects and recover the asymptotic error. This analysis provides a theoretical justification for the needto perform feature selection (dimensional reduction) of the input data prior to application ofmultivariate regression algorithms. Copyright # 2005 John Wiley & Sons, Ltd.KEYWORDS: classical least squares; partial least squares; prediction error; dimensional reduction; feature selection1. INTRODUCTIONMultivariate regression problems arise in the analysis of datain diverse applications. When the number of samples, n, is(much) larger than the number of regressors, p, and thecorresponding matrices are well conditioned, standardmethods such as ordinary least squares (OLS) can be typically applied. However, in many scientific fields, includingchemometrics in general and spectroscopy in particular, thecommon situation is that the number of samples is muchsmaller than the number of variables (n p), in which caseordinary least squares is indeterminate and thus inapplicable. The remarkable fact that predictions are possible evenin this setting stems from the (sometimes hidden) propertythat, although the data are presented in a high-dimensionalspace, they actually have a much lower intrinsic dimensionality d n. For example, in a spectroscopic measurement ofa system with three components at 1000 different wavelengths, although the measured spectrum is represented ina 1000-dimensional space, it is typically assumed to be in athree-dimensional subspace (or at most a five-dimensionalsubspace if the measuring device adds a random baselineshift and a random slope to the signal).*Correspondence to: B. Nadler, Department of Mathematics, YaleUniversity, New Haven, CT 06520, USA.E-mail: boaz.nadler@yale.eduIn this setting, for which standard methods such asOLS fail, classical least squares (CLS) and partial leastsquares (PLS) are two common and very successfulalgorithms applied in practice [1–4]. These methods aresometimes viewed as performing dimensional reduction,since in both CLS and PLS the data are projected onto afew data-dependent directions and regression is performedin this lower-dimensional subspace. The two methods differin the way this subspace is defined and in the regressionmethod employed in it. CLS, also known as the K-matrixmethod, is a direct method that requires full knowledge of allcomponents in the training samples of the measured systemand is thus typically applicable only to very simple systems[3]. Recently, however, modifications of the algorithm toinclude unmodeled interferences have been suggested[5,6], thus possibly extending its applicability. PLS, on theother hand, is an indirect method that requires only knowledge of the concentration of the substance of interest, is thusmore applicable than CLS and is the de facto standardcalibration method in spectroscopy [4].An important theoretical and practical question is what isthe expected performance of these algorithms on futuresamples given calibration on a finite and noisy training set,and how does this performance compare with that of competing algorithms such as principal component regression(PCR) and ridge regression (RR)? In the chemometricsCopyright # 2005 John Wiley & Sons, Ltd.

108 B. Nadler and R. R. Coifmanliterature this problem was tackled mainly by direct application of CLS, PLS and competing algorithms both on real datasets and on simulated data sets that follow a linear mixturemodel (see e.g. References [7–10]). In their seminal paper,Thomas and Haaland [9] investigated the effects of eightdifferent parameters on the prediction error of CLS, PLS andPCR by extensive Monte Carlo simulation studies. Wentzelland Vega-Montoto [10] also made an extensive numericalcomparison of PLS and PCR with simulated data containingmany components. The main conclusion of these studies isthat most algorithms have a similar performance, with eachalgorithm having its own regime of superiority so that noone algorithm is everywhere optimal. On the theoreticalfront, various works have attempted to estimate the prediction error for specific data sets using various approximationsfor error propagation [11–14], but no explicit formulae forthe linear mixture case were given.In the statistical literature the subject of multivariatecalibration has been addressed in many works [15–20].Much effort was put forth to elucidate the PLS algorithmfrom a statistical point of view [21–24], although a theory forthe performance of PLS under the linear mixture model witha finite and noisy training set was not considered. In terms oftheoretical formulae for the expected mean squared error ofprediction, most attention has been devoted to the studyof other multivariate regression algorithms such as thegeneralized least squares and best linear predictoralgorithms and not of the more common CLS and PLSalgorithms. In addition, most works consider only the caseof more observations than variables, n p, since these algorithms become indeterminate when n p. To overcome thisindeterminacy, minimal length regressors were proposed[18,19]. Theoretical work on the mean squared error ofprediction was mainly done on the univariate case (onlyone component in one dimension), where both asymptoticand exact expressions for the root mean squared error ofprediction (RMSEP) as well as confidence regions have beenderived for various regressors [16,20,25].Although both CLS and PLS perform a dimensional reduction, it is known empirically that an initial dimensionalreduction of the input data prior to application of thesealgorithms is often very beneficial in practice. Most workon this type of feature selection prior to application ofmultivariate algorithms has focused on methods to optimally select a subset of the original variables (wavelengthselection). Both Xu and Schechter [26] and Spiegelman et al.[27] gave a theoretical justification for wavelength selectionbased on an approximate analysis of the uncertainty error inthe computation of the regression vector under a linearmixture model.In this paper we extend these results and provide amathematical analysis of the expected RMSEP for both CLSand PLS under the linear mixture model. For CLS we showthat, although the asymptotic error for a very large trainingset is given by the noise level divided by the length of the netanalyte signal vector [11,28], for a finite training set of n noisysamples there are additional correction terms of orderOð1 nÞ, Oð1 n2 Þ, etc. The interesting property we find isthat, although the 1 n term is typically multiplied by an Oð1Þcoefficient, the 1 n2 term is multiplied by 2 p2 , where is theCopyright # 2005 John Wiley & Sons, Ltd.noise per co-ordinate and p is the dimensionality of the inputdata. Therefore in the ‘large p—small n’ regime, common inspectroscopy involving many more variables than samples,this correction term may actually dominate the overall error.From a statistical point of view these results are not surprising. In classification problems it is well known that theperformance of standard classification algorithms such asFisher’s linear discriminant analysis is greatly degraded inthe ‘large p—small n’ setting, since there appear correctionterms of the form 2 p n [29,30]. Therefore our results can beviewed as the analogues of these well-known formulae tomultivariate calibration problems.Indeed, many papers in the chemometrics literature showempirically that an initial dimensional reduction priorto application of PLS, typically achieved in practice bywavelength selection, is quite beneficial in decreasingprediction errors. Our error analysis, showing that someerror terms are of the form 2 p2 n2 , provides the theoreticaljustification for this empirical finding, as also concluded bySpiegelman et al. [27] and Xu and Schechter [26]. However,while both these works (as well as many others) suggestwavelength selection as the method of choice to performthis initial dimensional reduction, in this paper we showmathematically that, for complex systems with manyinterfering components and lack of specificity at any singlewavelength, wavelength selection methods have severelimitations and cannot in general achieve optimal predictionerrors. In contrast, we propose to use adaptive waveletfeature selection algorithms [31,32] to perform this initialdimensional reduction, and present some simulation resultsthat show their empirical success in achieving near-optimalprediction errors. Thus our analysis provides a justificationand a better theoretical understanding of the role of waveletsas a tool for feature selection prior to multivariate calibration. A survey of the recent literature indeed revealsan increasing use of wavelets in the analysis of spectroscopicsignals, with empirical reports that this use decreases(sometimes) the prediction errors of multivariate regressionalgorithms [33–37].The paper is organized as follows. In Section 2 we definethe probabilistic model of the input data and the multivariatecalibration problem. The analysis of CLS and PLS under thismodel is described in Section 3. The issue of feature selectionis described in Section 4. Section 5 presents numericalsimulations that verify the results of our analysis. Weconclude with a discussion and summary in Section 6.Mathematical proofs appear in the Appendix.2. MULTIVARIATE CALIBRATION UNDERTHE LINEAR MIXTURE MODEL2.1. NotationWe denote vectors by boldface lowercase letters, e.g. v, andmatrices by bold capital letters, e.g. C. The Euclidean norm ofa vector v is denoted kvk and its dot product with a vector wis denoted v w. Random variables are denoted by italiclowercase letters, e.g. u0 and u1 , while the mean of a randomvariable u is Efug. Noisy estimates of noise-free quantities and u .have a hat on top, e.g. vJ. Chemometrics 2005; 19: 107–118

The prediction error in CLS and PLS 1092.2. The linear mixture modelWe consider the standard multivariate calibration problemin spectroscopy, namely the determination of analyteconcentration from the absorbance spectrum of a complexmulticomponent system, under the following probabilisticsetting for the input data. We consider a system containing kdifferent components, denoted u1 ; u2 ; . . . ; uk , where eachcomponent uj is a random variable with mean j and uniquespectral response vector vj 2 Rp . We denote by Cp the k k(population) matrix of second moments of these randompvariables, with entries Ci;j ¼ Efui uj g. If all the averagesp j ¼ 0, then C is the covariance matrix. Therefore, withsome abuse of notation, we sometimes refer to Cp as thecovariance matrix.We assume that Cp is of full rank and that the vectorsfvj gkj¼1 are linearly independent in Rp , as otherwise areduced model with fewer random components can beformulated. Based on Beer’s law, we further assume thatthe noise-free logarithm of the spectrum, denoted x, islinearly related to the components viax¼kXð1Þuj vjj¼1whereas the measured spectrum is noisy and given by x ¼ x þ nð2Þpwhere n is a random noise vector in R whose p co-ordinatesare independent identically distributed random variableswith zero mean and unit variance and is a measureof the level of noise. We assume that u1 is the substanceof interest and, without loss of generality, scale all the otherinterfering components u2 ; . . . ; uk so that their correspondingspectral responses have unit norm (kvj k ¼ 1 for j 2). Thisscaling has no effect on the final prediction of u1 .The basic multivariate calibration problem can be cast asfollows. Given a finite training set of n noisy samples,f xi ; ui gni¼1 , related via Equations (1) and (2), withui ¼ ðui;1 ; ui;2 ; . . . ; ui;k Þ the vector of components for the ithsample, construct a regression fzunction f : Rp ! R to accurately predict u1 from future samples x. Since we assume alinear relation between components and spectra, in thispaper we focus on linear regressors of the formu 1 ¼ fð xÞ ¼ r xwhere r is the constructed regression vector. Note that in thispaper we consider models without an intercept and thereforewe do not mean center the data. As described below, meancentering, which is a preprocessing step typically employed inpractice, does not qualitatively change our results.Although this paper is written with a focus on chemometric applications, referring to x as the spectrum and ui asthe analyte concentrations, our analysis is general and thusapplicable to any other data modeled by Equations (1) and(2). In the statistics literature the linear mixture model (1) isalso known as the standard multivariate linear regressionmodel [38], while problems in which the predictor variablesare noisy as in Equation (2) are generally termed ‘error-invariables’ (EIV) problems.The model (1)–(2) has been used extensively as a benchmark in many simulation studies and in tests of newCopyright # 2005 John Wiley & Sons, Ltd.algorithms [9,10,26,39]. In this paper we present an asymptotic theory for the prediction error of both CLS and PLS onthis model. For simple systems with a single component weobtain explicit formulae, asymptotically exact in the limit ofsmall noise, for the expected mean squared error of prediction as a function of the number of training samples, n, thenoise level and the dimension p of the signals. Although forcomplex multicomponent systems the explicit computationof the different constants is essentially algebraically intractable, the prediction error has similar qualitative features asin the case of a single-component system, where an explicitformula is available.3. THE EXPECTED PREDICTION ERROR3.1. Classical least squaresFor the paper to be reasonably self-contained, we first brieflydescribe the steps in the classical least squares algorithm.Given a finite training set f xi ; ui gni¼1 , in CLS we first computeestimates f vj g for the (unknown) spectral responses fvj g byleast squares minimization: 2 n kXX x uvmini;j j ifvj g i¼1 j¼1The solution is011Ef xu1 g 1vB Ef CBvCB xu2 g CB 2 CCB . C ¼ C 1 B.@A@ . A.0 kvð3ÞEf xuk gwhere C is the k k matrix of second moments of the kcomponents u1 ; . . . ; uk in the training set, assumed to be offull rank. the k k matrix of spectral interferences,We denote by V i;j ¼ v i v j . Then the regression vectors comwith entries Vputed by CLS for the k different components are given by10 1 1r1vB r2 CB 2 CCB C 1 B vB . C ¼ V B . C@ . A@ . A0rkð4Þ kvFinally, prediction of u1 for new spectra x is given byu 1 ¼ x r1The question considered in this paper is how well u 1approximates the unknown value u1 , and specifically whatcan be said about the mean squared error of predictionEfðu 1 u1 Þ2 g, when the regression vector r1 is constructedfrom a finite and noisy training set. Before considering thecase of finite n, we first state the following well-known resultabout CLS regression as the number of training samplesapproaches infinity.Theorem 1As n ! 1, the regression vector computed by CLS for the jthcomponent is given byrj ¼v?j2kv?j kJ. Chemometrics 2005; 19: 107–118

110 B. Nadler and R. R. Coifmanwhere v?j is the net analyte signal vector of the jth component[11]. The corresponding root mean squared error of prediction is given byRMSEPðCLS; n ¼ 1Þ ¼ kv?j kð5ÞA proof of this theorem appears in the Appendix. Itshows that, as n ! 1, CLS computes all spectral responsesfvj g without error, and by Equation (4) also computesan error-free net analyte signal vector. The predictionerror is therefore due only to the noise in the new unseenspectral data, and for an unbiased estimator CLS yields theoptimal prediction possible under a mean squared errorcriterion.When the regression vector is computed from a finite set ofnoisy samples, the prediction errors may be significantlylarger than in Equation (5), since various estimates in theCLS algorithm become noisy. Intuitively, multivariatecalibration is more difficult either when different components are highly correlated in the training set or when thereare non-negligible interferences amongst the differentspectral responses vj . In order to quantify these effects, �ffiffiffiffin1Xsj ¼u2ð6Þn i¼1 i;jand denote by 0 the minimal eigenvalue of the covariancematrix C of the training set. In addition, we define V to be thek k matrix of interferences of the noise-free spectral responses v1 ; . . . ; vk , with entries V i;j ¼ vi vj , and denote by 0its smallest eigenvalue.The following theorem and its corollary, both proven inthe Appendix, quantify the prediction error in CLS with afinite training set.Theorem 2Let r1 denote the estimated regression vector computed byCLS with a finite number of training samples. Thenr1 ¼v?12kv?1k vs 2 s2þ pffiffiffif1 þf þ Oð 3 Þn 20 0 2n 0 0The expected mean squared error of prediction for u1 admitsthe formEfMSEPðCLS; nÞg ¼ Efðu 1 u1 Þ2 g ð8Þ 2c1 2 p2¼ 2 1 þ þ 2 ðc2 þ oð1ÞÞnn v? 1where the constants c1 and c2 are complicated functions ofthe covariance matrices C and Cp and the spectral responsesvj but are independent of , n and p.Example 1In the case of a system with a single component u1 (k ¼ 1) thecoefficients in (7) and (8) can be evaluated explicitly. Specifically, given a training set of n samples of the form xi ¼ ui v þ ni , where for simplicity the subscript notation is can be written asdropped from v1 and u1 , the estimate v ¼ v þ pffiffiffi vnnsð9Þwhere s is given by (6) andn1 X ui nin ¼ pffiffiffins i¼1is a normal random variable in Rp whose p co-ordinates allhave zero mean and unit variance. The regression vector is k r¼vvk2 , leading to the following predicted value u for anew noisy sample x:! n v n vkvk2u ¼ uþ pffiffiffiþ 22ns k k2 kkvkvkvThe corresponding expected mean squared error of prediction is" 21 Efu2 g1þEfMSEPg ¼n s2kvk2# 2Efu2 g p2 8p 24 p 44þOð þ Þs2n2nkvk2 s2ð10Þð7Þwhere v ¼ maxj kvj k, s ¼ maxj sj , f1 is a random noise vectorin Rp whose p co-ordinates all have zero mean and Oð1Þvariance and f2 is a vector whose p co-ordinates are all OðpÞ.The p co-ordinates of f1 are all linear combinations of thenoises ni in the training set, with the exact coefficients beingcomplex functions of the covariance matrix C and thespectral responses vj . The vector f2 , on the other hand, is acomplex quadratic function of the original noises in thetraining set, such that all its p co-ordinates are OðpÞ.Equation (7) shows that for the case of a finite training setthere can be substantial differences between the noise-freenet analyte signal and the one estimated by CLS. Thefollowing cor

In the statistical literature the subject of multivariate calibration has been addressed in many works [15–20]. Much effort was put forth to elucidate the PLS algorithm from a statistical point of view [21–24], although a theory for the performance of PLS under the linear mixture mode

Related Documents:

cls I : C Czerny op 599 nr 38 cls II : C Czerny op 599 nr 56 cls III : C Czerny op 599 nr 69 cls IV : C Czerny op 849 nr 28 cls V : C Czerny op 299 nr 4 cls VI : C Czerny op 299 nr 14 cls VII : C Czerny op 299 nr 29 cls VIII : C Czerny op 299 nr 34 . cls IX : C Czerny op 740 nr 8 - la minor

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Min Longitude Error: -67.0877 meters Min Altitude Error: -108.8807 meters Mean Latitude Error: -0.0172 meters Mean Longitude Error: 0.0028 meters Mean Altitude Error: 0.0066 meters StdDevLatitude Error: 12.8611 meters StdDevLongitude Error: 10.2665 meters StdDevAltitude Error: 13.6646 meters Max Latitude Error: 11.7612 metersAuthor: Rafael Apaza, Michael Marsden

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được