Multivariate Data Analysis For Omics - Metabolomics

3y ago
67 Views
3 Downloads
9.39 MB
228 Pages
Last View : 1d ago
Last Download : 2m ago
Upload by : Esmeralda Toy
Transcription

Multivariate Data Analysis forOmicsSeptember 2-3 2008Susanne WiklundIID 1062

M ltiMultivariatei t DataD t AnalysisA l i andd ModellingM d lliin “Omics”Outline1Day 1 Chapter 1– Introduction multivariate data analysis– IntroductionId i to “omics”“ i ”– Introduction to Principal component analysis Chapter 2––––OverviewOi off ddatat tablest blHow PCA worksPCA examplePCA diagnostics Chapter 3– PCA for finding patterns, trends and outliers– PCA example Chapter 4– Data processing– Scaling– Normalisation2

Day 2 Chapter 5–––––Introduction to Orthogonal partial least squares (OPLS)From PCA to OPLS-DAClassificationBiomarker identificationMultiple treatments Chapter 6– Validation3Exercises Foods: PCARats Metabonomics 1: Metabolomics, NMR data, PCAHealth: clinical data, PCA using paired samplesMSMouse: Metabolomics, LC/MS data, PCA and OPLS-DA, task 2 notincluded, miss classificationGenegrid I: Micro array, PCA OPLS-DAOvarian cancer: Proteomics, MS data, OPLS-DA, S-plotPCA vs. OPLS-DA: Metabolomics, NMR data, PCA and OPLS-DAGC/MS metabolomics: Resolved and integrated GC/MS data, OPLS-DA, Splot and SUS-plotRats Metabonomics 2: Metabolomics, NMR data, OPLS-DA, S-plot, SUS-plotIdentification of bias effects in Transcriptomics data: micro array data, PCA,OPLS-DAProteomics anti diabetics: Proteomics, MS data Underscore means that all participants should do these exercises.4

MultivariateMlti i t AAnalysisl ifor ”omics” dataChapterp 1IntroductionGeneral cases that will be discussed during this coursePCANMR METABOLOMICS PCA VS OPLSDA.M1 (PCA-X), PCAt[Comp. 1]/t[Comp. 2]Colored according to classes in M10,4-0,2C3-0,4020,2A6AI7A8A7 r1C8D3 r1B5 R1B4 C5B5 C6 C7C4B8B3B6B3 r10,10,0-0,1-0,2-0,3B7-0,6-0,4-0,5-0,80,8 -0,60,6 -0,40,4 -0,20,2 -0,00,00,20,40,60,8t[1]R2X[1] 0,333338R2X[2] 0,211739D3 E1D2D1F1E3F2E4C3B3C1C4C2B2B4B1B6B5A2B8C5A1B7C7F3D5 F4D4E5E6F5A3C8A4E7F7F6C6AI7A5PCA Patterns Trends Outlier detectionSUS-plotA8A6D8E8Between group variation1,00,8-0,6 -0,5 -0,4 -0,3 -0,2 -0,1 0,0 0,1 0,2 0,3 0,4 0,5 0,60,6t[1]0,4Ellipse: Hotelling T2 (0,95)SIMC A-P 12 - 2008-03-06 18:21:07 (U TC 1)8/15/2008-0,6Withhin group variaation0,3A4F3E4 Rr1E4F2F3 r1E1 D1 F1D2 r1B1D2A3A1E3C2E3 r1C1B2 E2 E3 r2C1 r2C1 r1A2 D3-0,0E20,4E6D4F5D5 A5F4 E5020,20,5F6E7to[2]F7S-plot120,6E8 r2E8D80,6t[2]OPLS-DANMR METABOLOMICS PCA VS OPLSDA.M2 (OPLS/O2PLS-DA), OPLS-DA WT vs MYB76t[Comp. 1]/to[XSide Comp. 2]Colored according to classes in M212R2X[1] 0,156759Ellipse: Hotelling T2 (0,95)R2X[XSide Comp. 2] 0,211296SIMCA-P 12 - 2008-03-04 14:10:52 (UTC 1)OPLS-DA Classification Potential biomarkers Multiple uci6 PhosphoricGlycerica1,2-Bis(trHEPTANOICD-Myo NOSEmyo-InositETHANOLAMIButanoic aSucroseSilane,(dGlucoseEthanolamiDigalactosEITTMS N12MalicmacidDisilathiaETHANOLAMIQuinicNAaci 2-PiperidiL-Valine( FructoseL-GlutaminNAGlucaric mineThreonic aNA (81,2,3-ButaNA Sucrose (8SuberylglNA Uridine2,3-Dihydr(3M000000 A1GalactosylSucroseSTEARICACaciCitricEITTMS N12NAPYROGLUTAMPentonica (8L-GlutaminL-AsparagiRibosemetNANAGalactoseEITTMS N12NAbeta-SitosGlutamineNANAalpha-LINONA CARBOHYEITTMS N12EITTMS N12NASalisylicFRUCTOSE-1EITTMS N12NA ALINOLEIC-1,0-1,01 0 -0,80 8 -0,60 6 -0,40 4 -0,20 2 -0,00 0 0,20 2 0,40 4 0,60 6 0,80 8 1,0102

Outline Need for Multivariate Analysis– Example Measurements– Univariate, Bivariate, Multivariate Why Multivariate methods Introduction to Multivariate methods––––Data tables and NotationWhat is a projection?Concept of Latent Variable“Omics” Introduction to principal component analysis8/15/20083Background Needs for multivariate data analysis Most data sets today are multivariate– due to(a) availability of instrumentation((b)) complexitypy of systemsyand processesp Continuingg uni- and bivariate analysisy is––8/15/2008often misleadingoften inefficientex: will be describedex: t-test on 245 variables4

Multivariate Data Analysis Extracting information from data with multiple variables by using allthe variables simultaneously. It’s all about:– How to get information out of existing multivariate data It’s much less about:– How to structure the problem– Which variables to measure– Which observations to measure (DoE)8/15/20085Introduction to “omics” “omics” in the icsToxicogenomicsAnd many more8/15/2008 The “omics” data in this courseincludes– Metabolomics– Proteomics– Transcriptomics What do they have in common?––––Last 5 lettersFew samplesMany variablesMeasurement of all detectable speciesrepresented i.e. very complex data– Classification and diagnostics– Biomarkers– Explorepbiologygy6

Introduction to “omics”Metabolomics“comprehensive analysis of the whole metabolome under a given setof conditionsconditions”[1][1]Metabonomics”the quantitative measurement of the dynamic multiparametricpof livingg systemsyto ppathophysiologicalp ygmetabolic responsestimuli or genetic modification” [2]1. Fiehn,, O.,, et.al Metabolite profilingpg for pplant functional ggenomics. Nature Biotechnology.gy 2000;18:1157;1161.2. Nicholson, J. K., et.al 'Metabonomics': understanding the metabolic responses of living systems topathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data.Xenobiotica. 1999;29:1181-1189.;8/15/20087Objectives in “Omics” Study organisms as integrated systems––––GenesProteinsmetabolic pathwayscellular eventsExtracts and distil information on– Genes– Disease– Physiological state– Diet– Biological age– Nutrition Create new diagnostic tools One major goal is to extract biomarkers and understand the interplay betweenmolecular and cellular components8/15/2008disease 8

“Omics” workflowExperimentProblem formulation Sample preparation Data collection Aim GoalExperimental designData pre-processing Nr of samplesp Gender Age etc Alignment Phasing Normalisation Integration/bucketing Peak pickingData analysis PCA OPLS OPLS-DA O2PLS Hierarchical modelling8/15/20089Today's Data GC/MS, LC/MS, NMR spectrum or genechip– Problems––––– c. 10,000 peaks for Human urineMany variablesFew observationsNoisy dataMissing dataMultiple responsesImplications– High degree of correlation– Difficult to analysey with conventional methods KNData Information– Need waysy to extract information from the data– Need reliable, predictive information– Ignore random variation (noise) M lti i t analysisMultivariatel i isi theth toolt l off choiceh i8/15/200810

Causality vs Correlation Perturbation of a biological system causes myriad changes, only some will be directlyrelated to the cause– Typically we find a population of changes with statistical methods– May be irrelevant or even counter-directional– Further biological evidence always requiredReinforcingeffectsAlteredGene ExpressionP t i SProteinSynthesisth iMetabolitesEnvironmental factorsor geneticti makeupkBystandereffectsCritical effectsDiseaseCompensatoryeffectsNo disease-relatedeffect8/15/200811Correlation and CausalityCorrelation or causation?8075Inhhabitants (in thousannds)AlthoughAlthh ththe twotvariables arecorrelated, thisdoes not implythat one causes theother!Real but noncausal, or Number of storks in Oldenburg 1930 - 19368/15/200812

Data with many Variables Multivariate– More than 6 variables N Observations– Humans, rats, plants– trials, time pointsK K Variables– SpecSpectra,a, peakpea tablesab es Most systems are characterised by 2-6underlying processes yet we measurethousands of thingsN8/15/200813Observations and spectroscopic variablesNMR data from one observation350300 EachE h samplel spectrum isi oneobservation2502200150100 EachE hddatat pointi t iin theth spectrumtwillillrepresent one 0,780,580,380,180Var ID (No)SIMCA-P 12 - 2008-07-08 13:26:13 (UTC 1)180 Variables can also be resolved andintegrated, in that case each integralwill create a 88,028,18,068,148,1808,222100Var ID (No)SIMCA-P 12 - 2008-07-08 13:37:31 (UTC 1)8/15/200814

Types of Data in “omics”FieldObservations (N)Variables (K)MetabolomicsBiofluids,plant extracts, tissuesamplesProteomicsTissue SamplesGenomics/transcr Tissue SamplesiptomicsSpectra from: 1H NMR,1C NMR 1H-13C NMR,GC/MS LC/MSGC/MS,LC/MS,UPLC/MS2D GelsElectrophoresis/MSMicro arraysarrays, FluorescenceprobesChromatography ColumnsColumns, Solvents,SolventsAdditives, MixturesPhysical PropertiesProperties,Retention Times8/15/200815Poor Methods of Data Analysis Plot pairs of variables– Tedious,Tedious impractical– Risk of spurious correlations– Risk of missing information Select a few variables and use MLR– Throwing away information– Assumes no ‘noise’ in X– One Y at a timeX1 X2 X38/15/2008Y1 Y2 Y316

Development of Classical Statistics – 1930s Assumptions:Multiple regressionCCanonicali l correlationl tiLinear discriminant analysisAnalysisa ys s oof vavariancea ce Independent X variables Precise X variables, error in Y only Many more observations than variablesK Regression analysis one Y at a timeTables arelong and lean No missing dataN8/15/200817Risks with Classical Methods Comparing two groups (t-test)Group 1 Typically 5% significance level usedGroup 2– Type I errors: false positives, spurious results– Type II errors: false negatives, risk of notseeing the informationRi k off SRiskSpuriousiResultRlt Type I Risk 1 - 0125150No. of Variables8/15/200818

Research in 21st Century Experimentalpcosts,, ethics,, regulationsg few observations Instrumental & electronics revolution many variables Chemometrics: short & wide data tablesKN8/15/200819A Better Way Multivariate analysis by Projection– Looks at ALL the variables togetherg– Avoids loss of information– Finds underlying trends “latent variables”– More stable models8/15/200820

Why MVDA by Projections (PCA & OPLS) ? Deals with the dimensionality problem Handles all types of data tables– Short and wide, N K– Almost square, N K– Long and lean, N K Separates regularities from noise– Models X and models Y– Models relation between X and Y– Expresses the noise Extracts information from all datasimultaneously– DataD t are nott theth same as informationi fti Handles correlation Results are displayedp y graphicallyg py Copes with missing data Robust to noise in both X and /200821What is a Projection?Î Reduction of dimensionality, model in latent variables! Algebraically– Summarizes the information in theobservations as a few new (latent)variables Geometrically– The swarm of points in a K dimensionalspace (K number of variables) isapproximated by a (hyper)plane and thepoints are projected on that plane.8/15/200822

What is a Projection?j Variables form axes in a multidimensional space An observation in multidimensional space is a point Project points onto a plane8/15/200823Fundamental Data Analysis ObjectivesIIIOverviewTrendsOutliersQuality ControlBi l i l DiversityBiologicalDiitPatient MonitoringPCA8/15/2008ClassificationPattern RecognitionDiagnosticsHealthy/DiseasedT i it mechanismsToxicityh iDisease progressionSIMCACDiscriminationDiscriminating betweengroupsBiomarker candidatesComparing studies ring blocks ofomics dataMetab vs Proteomic vsGenomicCorrelation spectroscopy(STOCSY)O2 PLSO2-PLS24

Summary Data 2008– Short wide data tables– Highly correlated variables measuring similar things– Noise, missing data Poor methods of analysis– One variable at a time– Selection of variables (throwing away data) Fundamental objectives– Overview & Summary– Classification & Discrimination– Relationships Multivariate methods use redundancy in data to:– Reduce dimensionality– Improve stability– Separate signal from noise8/15/200825Principal Components Analysis (PCA)The foundation of all latent variable projectionp jmethods

Correlation between VariablesVariable 1443 3 Sd322Variable 1V10-1110-1-2-3-3 Sd-2-410Variable 220Tid30-3402 outliers!-4-44-3-201234Variable 2 3 Sd3-121 The information is found in thecorrelation pattern - not in the individualvariables!0-1-2-3-3 Sd-41020Tid30408/15/200827Principal Components Analysis Data visualisation and simplification– Information resides in the correlation structure of the data– Mathematical principle of projection to lower dimensionality2 VariablesV1V211.322.332.743.9 Many VariablesV1V2V311.30.422.31.232.72.143.94.6 VnPC2PC1 3D 2D8/15/200828

PCA Simplifies Data PCA breaks down alarge table of data intotwo smaller ones Plots of scores andloadings turn data intopicturesi t Correlations amongobservations andvariables are easilyseenD tDataMany VariablesSCORES Summarise the observations Separates signal from noisepatterns,, trends,, clusters Observe pLOADINGS Summarise the variables ExplainE l i theh positioni i off observationsbiini scores plotl8/15/200829PCA Converts Tables to PicturesPCA converts table intotwo interpretable plots:InterpretationScores plot relates to observations8/15/2008Loadings plot relates to variables30

PCA ExampleProblem: To investigate patterns of food consumption in Western Europe, thepercentage usage of 20 common food products was obtained for 16 countriesPerform a multivariate analysis (PCA) to overview dataFood consumption patterns for 16 Europeancountries (part of the data).COUNTRY Grain InstantTea Sweet- Bis- PaTiIn Fro Fro Fresh FreshTi Jam Garlic ButterMargcoffee coffeener cuits soup soup potat fish veg apple orange EnglandPortugalAustriaSwitzerlS 2919494512531General PCA Example - FoodsObservationsScores plot8/15/2008VariablesLoadings plot32

PCA to overview 1– aim: identify trends andbiomarkers for toxicity PCA useful to identifyoutliers, biologicaldiversity and toxicitytrendsMetabonomics coded.M17 (PCA-X)t[Comp. 1]/t[Comp. 2]Colored according to Obs ID ( ClassID)ffafcssasc604020t[2] Example: Toxicitystudy of rats Two different types ofrats and two differenttypes of drugs [1]R2X[1] 0,249915R2X[2] 0,226868Ellipse: Hotelling T2 (0,95)SIMCA-P 12 - 2008-07-07 17:23:48 (UTC 1)8/15/200833PCA for Overview 2 Example: HR/MASpoplar plants1HNMR METABOLOMICS PCA VS OPLSDA.M1 (PCA-X), PCAt[Comp. 1]/t[Comp. 2]Colored according to classes in M1NMR study fromF7 Interpretation of PCA scores shows patternsand trends0,4F6E7E6D4F5D5 A5F4 E5A6A4AI7F3E4 Rr1A8A7 r1E4F2F3 r1E1 D1 F1D2 r1B1D2 A3A1E3C2C8E3 r2r1C1E3C1C1 r2r2B2C1C1 r1r1 E2A2 E3 r2D3D3 r1B5 R1B4 C5B5 C6 C7C4B8C3B3B6B3 r1B70,2t[2] Scores plot shows poplar samples from twodifferent types one wild type and the othertransgeniciE8 r2E8D80,6– Aim: biomarkers to explore biology12-0,0-0,2-0,4-0,6-0,8 -0,6 -0,4 -0,2 -0,00,20,40,60,8t[1]R2X[1] 0,333338R2X[2] 0,211739Ellipse: Hotelling T2 (0,95)SIMCA-P 12 - 2008-03-06 18:21:07 (UTC 1)8/15/200834

PCA for Overview 3 Genetic study of mice– Black, White, Nude– Mass Lynx data PCA useful for QC of biologicalresults:– Biological diversity– Outlier detection– Finding trendsData courtesy of Ian Wilson and Waters Corporation Inc8/15/200835Summary Data is not Information Information lies in correlation structure Projection methods explain correlation structure among allvariables PCA provides graphical overview – natural starting point for anymultivariate data analysis PCA gives– Scores: summary of observations– Loadings: summary of variables8/15/200836

MultivariateMlti i t AAnalysisl ifor ”omics” dataChapterp 2Overview of Data Tables:p ComponentspAnalysisy((PCA))PrincipalContents NotationsScalingGeometric interpretationAl b i interpretationAlgebraici tt tiExamplePCA didiagnosticsti8/15/20082

Notation N Observations–––––KHumansPlantsOther individualsTrialsEtcN K Variables– Spectra– Peak tables– Etc8/15/20083NotationN number of observationsK number of variablesA number of principal componentsws scascalingg weweightsg s8/15/2008t1, t2,., tAscores ((formingg matrix T))p1, p2,., pAloadings (forming matrix P)4

Key Concepts with Multivariate Methods1.Data must be scaled or transformed appropriately2.Data may need to be ‘cleaned’ Outliers InterestingBut they can upset a modelM t detect,Mustd t t investigateiti t andd possiblyibl removedd3.Need to determine how well the model fits the data.4.Fit does not give Predictive ability! Model information not noise – avoid overfitNeed to estimate predictive ability8/15/20085Data Pre-Processing - TransformationsIf the data are not approximately normally distributed, a suitable transformation might berequired to improve the modelling resultsBefore transformation After log-transformation– skew distribution– More close to normal distributioncuprum.DS1 cuprumHistogram of DS1.kNicuprum.M1 (PCA-X), PCA for overview log-transformHistogram of 06.26.46.6 BinsBins6

Scaling Example - Height vs WeightData for 23 individuals (22 players referee in a football match)HeightHi ht (m)( ) 11.88 1.611 61 11.6868 1.751 75 11.7474 1.671 67 11.7272 1.981 98 11.9292 1.71 7 1.771 77 1.921 92Weight (kg) 86 7473847978809690 80 8693Height (m) 1.6 1.85 1.87 1.94 1.89 1.89 1.86 1.78 1.75 1.8 1.68W i ht (kg)Weight(k ) 75 8485969486889980 82 76Same scaleSame spread100100Right: unscaled,outlier is not soeasy to spot!80Body weightt (kg)Body weight ((kg)Left: scaled90908070701.61.71.81.902102030Body height (m)Body height (m)8/15/20087Data Pre-Processing - Scaling Problem: Variables can have substantially different ranges DiffDifferentt ranges can cause problemsblffor modellingd lli anddinterpretation Defining the length of each variable axis ii.e.e the SD Default in SIMCA: To set variation along each axis to one (unitvariance)x3x3x2x18/15/2008x2x18

Unit Variance Scaling (UV) PCA is scale dependent–Is the size of a variable important?p1/SD/XUVscalingws ScalingSli weighti ht isi 1/SD forf eachh variablei bl i.e.idivide each variable by its standard deviation Variance of scaled variables 18/15/20089Summary Variables may need to be transformed prior to analysis to make themmore normallyy distributed Results are scale dependent – which scaling is appropriate?– (will( ill come backb k to thishi ini chapterh4)) Default is UV scaling – all variables given equal weight Not usually recommended with spectroscopic data where no scaling isthe norm Compromise is Pareto scaling which is commonly used inmetabonomic studies (Chapter 4)8/15/200810

MultivariateMlti i t AAnalysisl ifor ”omics” dataHow PCA WorksPCA - Geometric Interpretationx3x2x1 We construct a space with K dimensions – 3 sho

Introduction to Multivariate methodsIntroduction to Multivariate methods – Data tables and Notation – What is a projection? – Concept of Latent Variable –“Omics” Introduction to principal component analysis 8/15/2008 3 Background Needs for multivariate data analysis Most data sets today are multivariate – due todue to

Related Documents:

Keywords: Omics integration, Omics science, Clinical application, Risk prediction, Proteomics, Metabolomics, Genomics Introduction The past two decades have been witness to an explosion of data stemming from the development and gradual maturation of ‘omics’ technologies and bioinformatics. Today, whole-genome sequencing has become a routine

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

6.7.1 Multivariate projection 150 6.7.2 Validation scores 150 6.8 Exercise—detecting outliers (Troodos) 152 6.8.1 Purpose 152 6.8.2 Dataset 152 6.8.3 Analysis 153 6.8.4 Summary 156 6.9 Summary:PCAin practice 156 6.10 References 157 7. Multivariate calibration 158 7.1 Multivariate modelling (X, Y): the calibration stage 158 7.2 Multivariate .

An Introduction to Multivariate Design . This simplified example represents a bivariate analysis because the design consists of exactly two dependent or measured variables. The Tricky Definition of the Multivariate Domain Some Alternative Definitions of the Multivariate Domain . “With multivariate statistics, you simultaneously analyze

Multivariate longitudinal analysis for actuarial applications We intend to explore actuarial-related problems within multivariate longitudinal context, and apply our proposed methodology. NOTE: Our results are very preliminary at this stage. P. Kumara and E.A. Valdez, U of Connecticut Multivariate longitudinal data analysis 5/28

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

JS/Typescript API JS Transforms [More] WebGL support Extras What’s next? 8 Completely rewritten since 0.11 Powerful and performant Based on tornado and web sockets Integrated with bokeh command (bokeh serve) keep the “model objects” in python and in the browser in sync respond to UI and tool events generated in a browser with computations or .