Modern Statistics For Modern Biology

2y ago
489 Views
79 Downloads
400.71 KB
24 Pages
Last View : 1d ago
Last Download : 1m ago
Upload by : Angela Sonnier
Transcription

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore InformationModern Statistics for Modern BiologyIf you are a biologist and want to get the best out of the powerful methodsof modern computational statistics, this is your book. You can visualize andanalyze your own data, apply unsupervised and supervised learning, integrate datasets, apply hypothesis testing, and make publication-quality figures using the power of R/Bioconductor and ggplot2.This book will teach you ‘‘cooking from scratch’,’ from raw data to beautifulilluminating output, as you learn to write your own scripts in the R languageand to use advanced statistics packages from CRAN and Bioconductor. It covers a broad range of basic and advanced topics important in the analysis ofhigh-throughput biological data, including principal component analysis andmultidimensional scaling, clustering, multiple testing, unsupervised and supervised learning, resampling, the pitfalls of experimental design, and powersimulations using Monte Carlo, and it even reaches networks, trees, spatialstatistics, image data, and microbial ecology. Using a minimum of mathematical notation, it builds understanding from well-chosen examples, simulation, visualization, and above all hands-on interaction with data and code. R package msmb contains complete code and the example datasets, allowing students to recreate all examples, figures, and results in the book Solutions, slides, and dynamic material available on the course website Introduces methods on a ‘‘need to know’’ basis, so students tackle biological questions immediately and understand motivation for the methods Real-life examples done from scratch, guiding students through realisticcomplexities and building practical intuition Includes a wrap-up chapter that explains the complete workflow from design of experiments to analysis of results, identifying common pitfallswith big data All figures and results generated by the code in the book, demonstratinghow reproducible research worksSUSAN HOLMES is Professor of Statistics at Stanford University, California.She specializes in exploring and visualizing multidomain biological data, using computational statistics to draw inferences in microbiology, immunologyand cancer biology. She has published over 100 research papers, and has beena key developer of software for the multivariate analyses of complex heterogeneous data. She was the Breiman Lecturer at NIPS 2016, has been named aFields Institute fellow, and is currently a fellow at the Center for the AdvancesStudy of the Behavioral Sciences.WOLFGANG HUBER is Research Group Leader and Senior Scientist at theEuropean Molecular Biological Laboratory, where he develops computationalmethods for new biotechnologies and applies them to biological discovery.He has published over 150 research papers in functional genomics, cancerand statistical methods. He is a founding member of the open-sourcebioinformatics software collaboration Bioconductor and has co-authored twobooks on Bioconductor. in this web service Cambridge University Presswww.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore Information in this web service Cambridge University Presswww.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore InformationModernStatistics forModernBiologySusan HolmesStanford University, CaliforniaWolfgang HuberEuropean Molecular Biology Laboratory in this web service Cambridge University Presswww.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore InformationUniversity Printing House, Cambridge CB2 8BS, United KingdomOne Liberty Plaza, 20th Floor, New York, NY 10006, USA477 Williamstown Road, Port Melbourne, VIC 3207, Australia314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India79 Anson Road, #06–04/06, Singapore 079906Cambridge University Press is part of the University of Cambridge.It furthers the University’s mission by disseminating knowledge in the pursuit ofeducation, learning, and research at the highest international levels of excellence.www.cambridge.orgInformation on this title: www.cambridge.org/9781108705295DOI: 10.1017/9781108551441 Susan Holmes and Wolfgang Huber 2018This publication is in copyright. Subject to statutory exceptionand to the provisions of relevant collective licensing agreements,no reproduction of any part may take place without the writtenpermission of Cambridge University Press.First published 2018Printed and bound in Great Britain by Clays Ltd, Elcograf S.p.A.A catalogue record for this publication is available from the British Library.Library of Congress Cataloging-in-Publication DataISBN 978-1-108-70529-5 PaperbackAdditional resources for this publication at www.cambridge.org/msmbCambridge University Press has no responsibility for the persistence or accuracyof URLs for external or third-party internet websites referred to in this publicationand does not guarantee that any content on such websites is, or will remain,accurate or appropriate.Image credits for chapter openers: Chapter 1, Wikicommons;Chapter 4, xkcd.com/1347; Chapter 5, mikedabell/iStock/Getty Images;Chapter 6, extract from xkcd.com/882/; Chapter 7, The Matrix: scene 291 Close onComputer Screen Warner Bros.; Chapter 8, xkcd.com/1725;Chapter 9, Robert Orchard/Moment/Getty Images;Chapter 13, University of Adelaide Library: Rare Books and SpecialCollections, R.A. Fisher Digital Archive,http://hdl.handle.net/2440/81670. in this web service Cambridge University Presswww.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore InformationFor Sonia, Sara, Agnès, Johnny, Camille. . . and the “girls” who make me love the life sciencesFor Alexander in this web service Cambridge University Presswww.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore Information in this web service Cambridge University Presswww.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore InformationContentsIntroductionxvii1Generative Models for Discrete Data12Statistical Modeling193High-Quality Graphics in R534Mixture Models835Clustering1076Testing1397Multivariate Analysis1618High-Throughput Count Data1919Multivariate Methods for Heterogeneous Data21710 Networks and Trees24911 Image Data27912 Supervised Learning30913 Design of High-Throughput Experiments and Their Analyses337AcknowledgementsBibliographyStatistical ConcordanceIndex365367377379 in this web service Cambridge University Presswww.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore InformationExpanded ContentsIntroductionxviiWhat is happening in biological data analysis?1The challenge: heterogeneityxviiWhat’s in this book?xviiiComputational tools for modern biologistsxxWhy R and Bioconductor?xxiHow to read this bookxxiiGenerative Models for Discrete Data11.1Goals for this chapter11.2A real example11.3Using discrete probability models21.3.1Bernoulli trials31.3.2Binomial success counts41.3.3Poisson distributions51.3.4A generative model for epitope detection1.426Multinomial distributions: the case of DNA101.4.111Simulating for power1.5Summary of this chapter151.6Further reading151.7Exercises16Statistical Modeling192.1Goals for this chapter192.2The difference between statistical and probabilistic models202.3A simple example of statistical modeling202.3.1242.42.5Classical statistics for classical dataBinomial distributions and maximum likelihood252.4.125An exampleMore boxes: multinomial data272.5.1DNA count modeling: base pairs27Nucleotide bias272.5.22.6 in this web service Cambridge University Pressxvii2The χ distribution292.6.129Intermezzo: quantiles and the quantile–quantile plotwww.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore Informationexpanded contents ix2.72.82.92.102.112.122.133Chargaff’s Rule2.7.1Two categorical variables2.7.2A special multinomial: Hardy–Weinberg equilibrium2.7.3Concatenating several multinomials: sequence motifs and logosModeling sequential dependencies: Markov chainsBayesian thinking2.9.1Example: haplotype frequencies2.9.2Simulation study of the Bayesian paradigm for the binomialExample: occurrence of a nucleotide pattern in a genome2.10.1 Modeling in the case of dependenciesSummary of this chapterFurther readingExercisesHigh-Quality Graphics in R3.1 Goals for this chapter3.2 Base R plotting3.3 An example dataset3.4 ggplot23.4.1Data flow3.4.2Saving figures3.5 The grammar of graphics3.6 Visualizing data in 1D3.6.1Barplots3.6.2Boxplots3.6.3Violin plots3.6.4Dot plots and beeswarm plots3.6.5Density plots3.6.6ECDF plots3.6.7The effect of transformations on densities3.7 Visualizing data in 2D: scatterplots3.7.1Plot shapes3.8 Visualizing more than two dimensions3.8.1Faceting3.8.2Interactive graphics3.9 Color3.10 Heatmaps3.10.1 Dendrogram ordering3.10.2 Color spaces3.11 Data transformations3.12 Mathematical symbols and other fonts3.13 Genomic data3.14 Summary of this chapter3.15 Further reading3.16 Exercises in this web service Cambridge University bridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore Informationxexpanded contents4Mixture Models834.1834.25Goals for this chapterFinite mixtures844.2.1Simple examples and computer experiments844.2.2Discovering the hidden group labels864.2.3Models for zero-inflated data904.2.4More than two components914.3Empirical distributions and the nonparametric bootstrap924.4Infinite mixtures944.4.1Infinite mixture of normals944.4.2Infinite mixtures of Poisson variables964.4.3Gamma distribution: two parameters (shape and scale)964.4.4Variance-stabilizing transformations994.5Summary of this chapter1024.6Further reading1034.7Exercises104Clustering1075.1Goals for this chapter1075.2What are the data and why do we cluster them?1085.2.11085.35.45.55.6How do we measure similarity?1105.3.11125.85.9 in this web service Cambridge University PressComputations related to distances in RNonparametric mixture detection1135.4.1k -methods: k -means, k -medoids and PAM1135.4.2Tight clusters with resampling114Clustering examples: flow cytometry and mass cytometry1155.5.1Flow cytometry and mass cytometry1155.5.2Data preprocessing1165.5.3Density-based clustering118Hierarchical clustering5.6.15.7Clustering can sometimes lead to discoveries120How to compute (dis)similarities between aggregated clusters? 120Validating and choosing the number of clusters1235.7.1Using the gap statistic1255.7.2Cluster validation using the bootstrap127Clustering as a means for denoising1295.8.1Noisy observations with different baseline frequencies1305.8.2Denoising 16S rRNA sequences1315.8.3Inferring sequence variants132Summary of this chapter1345.10 Further reading1345.11 Exercises135www.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore Informationexpanded contents xi6Testing6.1139Goals for this chapter1396.1.1Drinking from the firehose1406.1.2Testing versus classification1406.2Example: coin tossing1416.3The five steps of hypothesis testing1446.3.11456.4Types of error1466.5The t-test1476.5.1149The rejection regionPermutation tests6.6P-value hacking1506.7Multiple testing1506.8The family-wise error rate1516.8.11526.9The false discovery rate1526.9.1The p-value histogram1536.9.2The Benjamini–Hochberg algorithm for controlling the FDR154Bonferroni correction6.10 The local FDR6.10.17Local versus total1541566.11 Independent filtering and hypothesis weighting1566.12 Summary of this chapter1586.13 Further reading1596.14 Exercises159Multivariate Analysis1617.1Goals for this chapter1627.2What are the data? Matrices and their motivation1627.2.1Low-dimensional data summaries and preparation1647.2.2Preprocessing the data1667.3Dimension reduction1677.3.1Lower-dimensional projections1677.3.2How do we summarize two-dimensional data by a line?1687.4The new linear combinations1707.4.11707.5The PCA workflow7.6The inner workings of PCA: rank reduction1717.6.1Rank-one matrices1717.6.2How do we find such a decomposition in a unique way?1747.6.3Singular value decomposition1757.6.4Principal components1757.7Optimal lines171Plotting the observations in the principal plane1767.7.1PCA of the turtles data1787.7.2A complete analysis: the decathlon athletes1807.7.3How to choose k , the number of dimensions?183 in this web service Cambridge University Presswww.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore Informationxiiexpanded contents7.87.98PCA as an exploratory tool: using extra information7.8.1Mass spectroscopy data analysis1847.8.2Biplots and scaling1847.8.3An example of weighted PCA186Summary of this chapter1867.10 Further reading1877.11 Exercises188High-Throughput Count Data1918.1Goals of this chapter1918.2Some core concepts1928.3Count data1938.48.58.68.78.88.98.3.1The challenges of count data1938.3.2RNA-Seq: what about gene structures, splicing, isoforms?194Modeling count data1948.4.1Dispersion1948.4.2Normalization195A basic analysis1978.5.1Example dataset: the pasilla data1978.5.2The DESeq2 method1988.5.3Exploring the results1998.5.4Exporting the results201Critique of default choices and possible modifications2018.6.1The few-changes assumption2018.6.2Point-like null hypothesis201Multifactorial designs and linear models2028.7.1What is a multifactorial design?2028.7.2What about noise and replicates?2038.7.3Analysis of variance2048.7.4Robustness205Generalized linear models2078.8.1Modeling the data on a transformed scale2078.8.2Other error distributions2078.8.3A generalized linear model for count data208Two-factor analysis of the pasilla data8.10 Further statistical concepts in this web service Cambridge University Press1832082118.10.1Sharing of dispersion information across genes2118.10.2Count data transformations2128.10.3Dealing with outliers2148.10.4Tests of log2 -fold change above or below a threshold2148.11 Summary of this chapter2158.12 Further reading2168.13 Exercises216www.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore Informationexpanded contents xiii9Multivariate Methods for Heterogeneous Data2179.1Goals for this chapter2179.2Multidimensional scaling and ordination2189.2.1How does the method work?2209.2.2Robust versions of MDS2229.39.49.59.6Contiguous or supplementary information2249.3.1Known batches in data2259.3.2Removing batch effects2279.3.3Hybrid data and Bioconductor containers227Correspondence analysis for contingency tables2299.4.1Cross-tabulation and contingency tables2299.4.2Hair color, eye color and phenotype co-occurrence231Finding time . . . and other important gradients2339.5.1Dynamics of cell development2349.5.2Local nonlinear methods235Multitable techniques2379.6.1Covariation, inertia, co-inertia and the RV coefficient2379.6.2Mantel coefficient and a test of distance correlation2389.6.3Canonical correlation analysis (CCA)2399.6.4Sparse canonical correlation analysis (sCCA)2409.6.5Canonical (or constrained) correspondence analysis (CCpnA)2429.7Summary of this chapter2449.8Further reading2459.9Exercises24510 Networks and Trees24910.1 Goals for this chapter24910.2 Graphs25010.2.1What is a graph and how can it be encoded?25010.2.2Graphs with many layers: labels on edges and nodes25310.3 From gene set enrichment to networks25410.3.1Methods using predefined gene sets (GSEA)25410.3.2Gene set analysis with two-way table tests25510.3.3Significant subgraphs and high-scoring modules25610.3.4An example with the BioNet implementation25710.4 Phylogenetic trees25810.4.1Markovian models for evolution26010.4.2Simulating data and plotting a tree26110.4.3Estimating a phylogenetic tree26210.4.4Application to 16S rRNA data26310.5 Combining a phylogenetic tree into a data analysis10.5.1Hierarchical multiple testing in this web service Cambridge University Press265266www.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore Informationxivexpanded contents10.6 Minimum spanning trees10.6.1 MST-based testing: the Friedman–Rafsky test10.6.2 Example: bacteria sharing between mice10.6.3 Friedman–Rafsky test with nested covariates10.7 Summary of this chapter10.8 Further reading10.9 Exercises in this web service Cambridge University Press26827127127327427527611 Image Data11.1 Goals for this chapter11.2 Loading images11.3 Displaying images11.4 How are images stored in R?11.5 Writing images to file11.6 Manipulating images11.7 Spatial transformations11.8 Linear filters11.8.1 Interlude: the intensity scale of images11.8.2 Noise reduction by smoothing11.9 Adaptive thresholding11.10 Morphological operations on binary images11.11 Segmentation of a binary image into objects11.12 Voronoi tessellation11.13 Segmenting the cell bodies11.14 Feature extraction11.15 Spatial statistics: point processes11.15.1 Case study: interaction between immune cells and cancer cells11.15.2 Convex hull11.15.3 Other ways of defining the space for the point process11.16 First-order effects: the intensity11.16.1 Poisson process11.16.2 Estimating the intensity11.17 Second-order effects: spatial dependence11.17.1 Ripley’s K function11.18 Summary of this chapter11.19 Further reading11.20 129229429629829830130130230330330430530730730812 Supervised Learning12.1 Goals for this chapter12.2 What are the data?12.2.1 Motivating examples12.3 Linear discrimination12.3.1 Diabetes data12.3.2 Predicting embryonic cell state from gene expression309309310310311312316www.cambridge.org

Cambridge University Press978-1-108-70529-5 — Modern Statistics for Modern BiologySusan Holmes , Wolfgang HuberFrontmatterMore Informationexpanded contents xv12.4 Machine learning versus rote learning31912.4.1Cross-validation32112.4.2The curse of dimensionality32212.5 Objective functions32412.6 Variance–bias trade-off32612.6.1Penalization32612.6.2Example: predicting colon cancer from stool microbiomecomposition32712.6.3Example: classifying mouse cells from their expression profiles 33112.7 A large choice of methods12.7.1332Method hacking33412.8 Summary of this chapter33512.9 Further reading33612.10 Exercises33613 Design of High-Throughput Experiments and Their Analyses33713.1 Goals for this chapter33713.2 Types of experiments33813.3 Partitioning error: bias and noise33913.3.1Error models: noise is in the eye of the beholder33913.3.2Biological versus technical replicates34113.3.3Units versus fold changes34113.3.4Regular and catastrophic noise34213.4 Basic principles in the design of experiments34213.4.1Confounding34213.4.2Effect size and replicates34213.4.3Clever combinations: Hotelling’s weighting example34313.4.4Blocking and pairing34513.4.5How many replicates do I need?34913.5 Mean–variance relationships and variance-stabilizing transformations35113.6 Data quality assessment and quality control35213.7 Longitudinal data35313.8 Data integration: use everything you (could) know35413.9 Sharpen your tools: reproducible research35513.10

Modern Statistics for Modern Biology If you are a biologist and want to get the best out of the powerful metho

Related Documents:

animation, biology articles, biology ask your doubts, biology at a glance, biology basics, biology books, biology books for pmt, biology botany, biology branches, biology by campbell, biology class 11th, biology coaching, biology coaching in delhi, biology concepts, biology diagrams, biology

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

DAT Study Tips* Biology Materials: DAT Destroyer, Feralis Biology Notes, Cliff's AP Bio 3rd Edition, DAT Bootcamp (Both Cliff’s AP Bio and Feralis Notes are free online) Biology is one of the most time consuming sections to study for, given that the scope of the material covered in DAT biology is so randomly big. Cliff's AP Bio 3rdFile Size: 527KBPage Count: 9Explore furtherDAT Bootcamp Biology Flashcards Quizletquizlet.comHow to Study for the DAT Biology Section the Right Way .datbootcamp.comFeralis Biology Notes DAT Study Tips Free Downloadferalisnotes.comFeralis Biology Notes? Student Doctor Network Communitiesforums.studentdoctor.netBiology Cumulative Exam Flashcards Quizletquizlet.comRecommended to you b

och krav. Maskinerna skriver ut upp till fyra tum breda etiketter med direkt termoteknik och termotransferteknik och är lämpliga för en lång rad användningsområden på vertikala marknader. TD-seriens professionella etikettskrivare för . skrivbordet. Brothers nya avancerade 4-tums etikettskrivare för skrivbordet är effektiva och enkla att