Statistical Analysis Handbook - StatsRef

3y ago
608 Views
82 Downloads
1.76 MB
100 Pages
Last View : 22d ago
Last Download : 2m ago
Upload by : Adalynn Cowell
Transcription

Statistical AnalysisHandbookA Comprehensive Handbook of StatisticalConcepts, Techniques and Software Tools2018 EditionDr Michael J de Smith

Statistical AnalysisHandbookA Comprehensive Handbook of StatisticalConcepts, Techniques and Software ToolsDr Michael J de Smith

Copyright 2015-2018 All Rights reserved. 2018 Edition. Issue version: 2018-1No part of this publication may be reproduced, stored in a retrieval system or transmitted in any formor by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, exceptunder the terms of the UK Copyright Designs and Patents Act 1998 or with the written permission ofthe authors. The moral right of the authors has been asserted. Copies of this edition are available inelectronic book and web-accessible formats only.Disclaimer: This publication is designed to offer accurate and authoritative information in regard tothe subject matter. It is provided on the understanding that it is not supplied as a form of professionalor advisory service. References to software products, datasets or publications are purely made forinformation purposes and the inclusion or exclusion of any such item does not imply recommendationor otherwise of the product or material in question.For more details please refer to the Guide’s website: www.statsref.comISBN-13978-1-912556-06-9 Hardback978-1-912556-07-6 Paperback978-1-912556-08-3 eBookPublished by: The Winchelsea Press, Drumlin Security Ltd, EdinburghFront inside cover image: Polar bubble plot (MatPlotLib, Python)Rear inside cover image: Florence Nightingale's polar diagram of causes of mortality, by month(source: Wikipedia)Cover image: Mandlebrot set fractal

5Table of Contents1 Introduction131.1 How to use this Handbook171.2 Intended audience and scope181.3 Suggested reading191.4 Notation and symbology231.5 Historical context251.6 An applications-led discipline312 Statistical data372.1 The Statistical Method532.2 Misuse, Misinterpretation and Bias602.3 Sampling and sample size712.4 Data preparation and cleaning802.5 Missing data and data errors822.6 Statistical error872.7 Statistics in Medical Research882.7.1Causation902.7.2Conduct and reporting of medical research933 Statistical concepts1053.1 Probability theory1083.1.1Odds1093.1.2Risks1103.1.3Frequentist probability theory1123.1.4Bayesian probability theory1163.1.5Probability distributions1203.2 Statistical modeling1223.3 Computational statistics1253.4 Inference126www.statsref.com(c) 2018

63.5 Bias1273.6 Confounding1293.7 Hypothesis testing1303.8 Types of error1323.9 Statistical significance1343.10 Confidence intervals1373.11 Power and robustness1413.12 Degrees of freedom1423.13 Non-parametric analysis1434 Descriptive statistics1454.1 Counts and specific values1484.2 Measures of central tendency1504.3 Measures of spread1574.4 Measures of distribution shape1664.5 Statistical indices1704.6 Moments1725 Key functions and expressions1755.1 Key functions1785.2 Measures of Complexity and Model selection1855.3 Matrices1906 Data transformation and standardization1996.1 Box-Cox and Power transforms2026.2 Freeman-Tukey (square root and arcsine) transforms2046.3 Log and Exponential transforms2076.4 Logit transform2106.5 Normal transform (z-transform)2127 Data exploration7.1 Graphics and vizualisationwww.statsref.com213216(c) 2018

77.2 Exploratory Data Analysis8 Randomness and Randomization2332418.1 Random numbers2458.2 Random permutations2548.3 Resampling2568.4 Runs test2608.5 Random walks2618.6 Markov processes2718.7 Monte Carlo methods2778.7.1Monte Carlo Integration2778.7.2Monte Carlo Markov Chains (MCMC)2809 Correlation and autocorrelation2859.1 Pearson (Product moment) correlation2889.2 Rank correlation2989.3 Canonical correlation3029.4 Autocorrelation3049.4.1Temporal autocorrelation3059.4.2Spatial autocorrelation31010 Probability distributions33310.1 Discrete Distributions33910.1.1Binomial distribution33910.1.2Hypergeometric distribution34310.1.3Multinomial distribution34510.1.4Negative Binomial or Pascal and Geometric distribution34710.1.5Poisson distribution34910.1.6Skellam distribution35410.1.7Zipf or Zeta distribution35510.2 Continuous univariate distributions35610.2.1Beta distribution35610.2.2Chi-Square distribution35810.2.3Cauchy distribution361www.statsref.com(c) 2018

810.2.4Erlang distribution36210.2.5Exponential distribution36410.2.6F distribution36710.2.7Gamma distribution36910.2.8Gumbel and extreme value distributions37110.2.9Normal distribution37410.2.10Pareto distribution37910.2.11Student's t-distribution (Fisher's distribution)38110.2.12Uniform distribution38410.2.13von Mises distribution38610.2.14Weibull distribution39010.3 Multivariate distributions39210.4 Kernel Density Estimation39611 Estimation and estimators40511.1 Maximum Likelihood Estimation (MLE)40911.2 Bayesian estimation41412 Classical tests12.1 Goodness of fit re 7Lilliefors43112.2 Z-tests43312.2.1Test of a single mean, standard deviation known43312.2.2Test of the difference between two means, standard deviations known43512.2.3Tests for proportions, p43612.3 T-tests43812.3.1Test of a single mean, standard deviation not known43812.3.2Test of the difference between two means, standard deviation not known43912.3.3Test of regression coefficients440www.statsref.com(c) 2018

912.4 Variance tests44312.4.1Chi-square test of a single variance44312.4.2F-tests of two variances44412.4.3Tests of homogeneity44512.5 Wilcoxon rank-sum/Mann-Whitney U test44912.6 Sign test45313 Contingency tables45513.1 Chi-square contingency table test45913.2 G contingency table test46113.3 Fisher's exact test46213.4 Measures of association46513.5 McNemar's test46614 Design of experiments46714.1 Completely randomized designs47514.2 Randomized block designs47614.2.1Latin squares47714.2.2Graeco-Latin squares47914.3 Factorial designs48114.3.1Full Factorial designs48114.3.2Fractional Factorial designs48314.3.3Plackett-Burman designs48514.4 Regression designs and response surfaces48714.5 Mixture designs48915 Analysis of variance and covariance15.1 ANOVA49149615.1.1Single factor or one-way ANOVA50015.1.2Two factor or two-way and higher-way ANOVA50415.2 MANOVA50715.3 ANCOVA50915.4 Non-Parametric ANOVA510www.statsref.com(c) 2018

1015.4.1Kruskal-Wallis ANOVA51015.4.2Friedman ANOVA test51215.4.3Mood's Median51316 Regression and smoothing51516.1 Least squares52216.2 Ridge regression52816.3 Simple and multiple linear regression52916.4 Polynomial regression54316.5 Generalized Linear Models (GLIM)54516.6 Logistic regression for proportion data54716.7 Poisson regression for count data55016.8 Non-linear regression55416.9 Smoothing and Generalized Additive Models (GAM)55816.10 Geographically weighted regression (GWR)56016.11 Spatial series and spatial autoregression56516.11.1SAR models57116.11.2CAR models57516.11.3Spatial filtering models57917 Time series analysis and temporalautoregression58117.1 Moving averages58817.2 Trend Analysis59317.3 ARMA and ARIMA (Box-Jenkins) models59917.4 Spectral analysis60818 Resources61118.1 Distribution tables61418.2 Bibliography62918.3 Statistical Software63818.4 Test Datasets and data archives64018.5 Websites653www.statsref.com(c) 2018

1118.6 Tests Index65418.6.1Tests and confidence intervals for mean values65418.6.2Tests for proportions65418.6.3Tests and confidence intervals for the spread of datasets65518.6.4Tests of randomness65518.6.5Tests of fit to a given distribution65518.6.6Tests for cross-tabulated count data65618.7 R Code samples65718.7.1Scatter Plot: Inequality65718.7.2Latin Square ANOVA65818.7.3Log Odds Ratio Plot65918.7.4Normal distribution plot66018.7.5Bootstrapping660www.statsref.com(c) 2018

Chapter1

Introduction115IntroductionThe definition of what is meant by statistics and statistical analysis has changed considerably over the last fewdecades. Here are two contrasting definitions of what statistics is, from eminent professors in the field, some 60 years apart:"Statistics is the branch of scientific method which deals with the data obtained by counting or measuring theproperties of populations of natural phenomena. In this definition 'natural phenomena' includes all thehappenings of the external world, whether human or not." Professor Maurice Kendall, 1943, p2 [MK1]"Statistics is: the fun of finding patterns in data; the pleasure of making discoveries; the import of deepphilosophical questions; the power to shed light on important decisions, and the ability to guide decisions.in business, science, government, medicine, industry." Professor David Hand [DH1]As these two definitions indicate, the discipline of statistics has moved from being grounded firmly in the world ofmeasurement and scientific analysis into the world of exploration, comprehension and decision-making. At thesame time its usage has grown enormously, expanding from a relatively small set of specific application areas(such as design of experiments and computation of life insurance premiums) to almost every walk of life. Aparticular feature of this change is the massive expansion in information (and misinformation) available to allsectors and age-groups in society. Understanding this information, and making well-informed decisions on thebasis of such understanding, is the primary function of modern statistical methods.Our objective in producing this Handbook is to be comprehensive in terms of concepts and techniques (but notnecessarily exhaustive), representative and independent in terms of software tools, and above all practical interms of application and implementation. However, we believe that it is no longer appropriate to think of astandard, discipline-specific textbook as capable of satisfying every kind of new user need. Accordingly, aninnovative feature of our approach here is the range of formats and channels through which we disseminate thematerial — web, ebook and print. A major advantage of the electronic formats is that the text can be embeddedwith internal and external hyperlinks (shown underlined). In this Handbook we utilize both forms of link, withexternal links often referring to a small number of well-established sources, including MacTutor for bibliographicinformation and a number of other web resources, such as Eric Weisstein's Mathworld and the statistics portal ofWikipedia, that provide additional material on selected topics.The treatment of topics in this Handbook is relatively informal, in that we do not provide mathematical proofs formuch of the material discussed. However, where it is felt particularly useful to clarify how an expression arises,we do provide simple derivations. More generally we adopt the approach of using descriptive explanations andworked examples in order to clarify the usage of different measures and procedures. Frequently convenientsoftware tools are used for this purpose, notably SPSS/PASW, The R Project, MATLab and a number of morespecialized software tools where appropriate.Just as all datasets and software packages contain errors, known and unknown, so too do all books and websites,and we expect that there will be errors despite our best efforts to remove these! Some may be genuine errors ormisprints, whilst others may reflect our use of specific versions of software packages and their documentation.Inevitably with respect to the latter, new versions of the packages that we have used to illustrate this Handbookwill have appeared even before publication, so specific examples, illustrations and comments on scope orrestrictions may have been superseded. In all cases the user should review the documentation provided with thewww.statsref.com(c) 2018

16software version they plan to use, check release notes for changes and known bugs, and look at any relevantonline services (e.g. user/developer forums and blogs on the web) for additional materials and insights.The interactive web and PDF versions of this Handbook provide color images and active hyperlinks, and may beaccessed via the associated Internet site: www.statsref.com. The contents and sample sections of the PDF versionmay also be accessed from this site. In both cases the information is regularly updated. The Internet is now wellestablished as society’s principal mode of information exchange, and most aspiring users of statistical methods areaccustomed to searching for material that can easily be customized to specific needs. Our objective for such usersis to provide an independent, reliable and authoritative first port of call for conceptual, technical, software andapplications material that addresses the panoply of new user requirements.Readers wishing to obtain a more in-depth understanding of the background to many of the topics covered in thisHandbook should review the Suggested Reading topic.References[DH1] D Hand (2009) President of the Royal Statistical Society (RSS), RSS Conference Presentation, November 2009[MK1] Kendall M G, Stuart A (1943) The Advanced Theory of Statistics: Volume 1, Distribution Theory. Charles Griffin &Company, London. First published in 1943, revised in 1958 with Stuartwww.statsref.com(c) 2018

Introduction1.117How to use this HandbookThis Handbook is designed to provide a wide-ranging and comprehensive, though not exhaustive, coverage ofstatistical concepts and methods. Unlike a Wiki the Handbook has a more linear flow structure, and in principlecan be read from start to finish. In practice many of the topics, particularly some of those described in later partsof the document, will be of interest only to specific users at particular times, but are provided for completeness.Users are recommended to read the initial four topics — Introduction, Statistical Concepts, Statistical Data andDescriptive Statistics, and then select subsequent sections as required.Navigating around the PDF or web versions of this Handbook is straightforward, but to assist this process a numberof special facilities have been built into the design to make the process even easier. These facilities include:· Tests Index — this is a form of 'how to' index, i.e. it does not assume that the reader knows the name of the testthey may need to use, but can navigate to the correct item by the index description· Reference links and bibliography — within the text all books and articles referenced are linked to the fullreference at the end of the topic section (in the References subsection) in the format [XXXn] and in thecomplete bibliography at the end of the Handbook· Hyperlinks — within the document there are two types of hyperlink: (i) internal hyperlinks — when clicking onthese links you will be directed to the linked topic within this Handbook; (ii) external hyperlinks — theseprovide access to external resources for which you need an active internet connection. When the external linksare clicked the appropriate topic is opened on an external website such as Wikipedia· Search facilities — the web and PDF versions of this Handbook facilitate free text search, so as long as you knowroughly what you are looking for, you should be able to find it using this facilitywww.statsref.com(c) 2018

18Intended audience and scope1.2Intended audience and scopeIan Diamond, Statistician and at the time Chief Executive of the UK's Economic and Social Research Council(ESRC), gave the following anecdote (which I paraphrase) during a meeting in 2009 at the Royal Statistical Societyin London: "Some time ago I received a brief email from a former student. In it he said'your statistics course was the one I hated most at University and was more than glad when it was over. butin my working career it has been the most valuable of any of the courses I took. !'"So, despite its challenges and controversies, taking time to get to grips with statistical concepts and techniques iswell worth the effort.With this perspective in mind, this Handbook has been designed to be accessible to a wide range of readers —from undergraduates and postgraduates studying statistics and statistical analysis as a component of their specificdiscipline (e.g. social sciences, earth sciences, life sciences, engineers), to practitioners and professional researchscientists. However, it is not intended to be a guide for mathematicians, advanced students studying statistics orfor professional statisticians. For students studying for academic or professional qualifications in statistics, thelevel and content adopted is that of the Ordinary and Higher Level Certificates of the Royal Statistical Society(RSS), offered until 2017. Much of the material included in this Handbook is also appropriate for the GraduateDiploma level also, although we have not sought to be rigorous or excessively formal in our treatment of individualstatistical topics, preferring to provide less formal explanations and examples that are more approachable by thenon-mathematician with links and references to detailed source materials for those interested in derivation of theexpressions provided.The Handbook is much more than a cookbook of formulas, algorithms and techniques. Its aim is to provide anexplanation of the key techniques and formulas of statistical analysis, often using examples from widely availablesoftware packages. It stops well short, however, of attempting a systematic evaluation of competing softwareproducts. A substantial range of application examples is provided, but any specific selection inevitably illustratesonly a small subset of the huge range of facilities available. Wherever possible, examples have been drawn fromnon-academic and readily reproducible sources, highlighting the widespread understanding and importance ofstatistics in every part of society, including the commercial and government sectors.ReferencesRoyal Statistical Society: Professional Development section:https://www.rss.org.uk/RSS/pro dev/RSS/pro dev/Professional Development.aspxwww.statsref.com(c) 2018

Introduction1.319Suggested readingThere are a vast number of books on statistics — Amazon alone lists 10,000 "professional and technical" workswith statistics in their title. There is no single book or website on statistics that meets the need of all levels andrequirements of readers, so the answer for many people starting out will be to acquire the main 'set books'recommended by their course tutors and then to supplement these with works that are specific to theirapplication area. Every topic and subtopic in this Handbook almost certainly has at least one entire book devotedto it, so of necessity the material we cover can only provide the essential details and a starting point for deeperunderstanding of each topic. As far as possible we provide links to articles, web sites, books and softwareresources to enable the reader to pursue such questions as and when they wish.Most statistics texts do not make for easy or enjoyable reading! In general they address difficult technical andphilosophical issues, and many are demanding in terms of their mathematics. Others are much more approachable— these books include 'classic' undergraduate text books such as Feller (1950, [FEL1]), Mood and Graybill (1950,[MOO1]), Hoel (1947, [HOE1]), Adler and Roessler (1960, [ADL1]), Brunk (1960, [BRU1]), Snedecor and Cochrane(1937, [SNE1]) and Yule and Kendall (1950, [YUL1]) — the dates cited in each case are when the books wereoriginally published; in most cases these works then ran into many subsequent editions and though most are nowout-of-print some are still available. A more recent work, available from the American Mathematical Society andalso as a free PDF, is Grinstead and Snell's (1997) An Introduction to Probability [GRI1]. Still in print, and ofcontinuing relevance today, is Huff (1954, [HUF1]) "How to Lie with Statistics" which must be the top sellingstatistics book of all time. A more recent book, with a similar focus, i

standard, discipline-specific textbook as capable of satisfying every kind of new user need. Accordingly, an innovative feature of our approach here is the range of formats and channels through which we disseminate the material — web, ebook and print. A major advantage of the electronic formats is that the text can be embedded

Related Documents:

International Society for Bayesian Analysis (ISBA), and President of the International Society for Business and Industrial Statistics (ISBIS). He authored over 100 articles and 5 books. He is Editor-in-Chief of Applied Stochastic Models in Business and Industry and Wiley StatsRef. He is Fellow of ISBA and American Statistical

Module 5: Statistical Analysis. Statistical Analysis To answer more complex questions using your data, or in statistical terms, to test your hypothesis, you need to use more advanced statistical tests. This module revi

Statistical Methods in Particle Physics WS 2017/18 K. Reygers 1. Basic Concepts Useful Reading Material G. Cowan, Statistical Data Analysis L. Lista, Statistical Methods for Data Analysis in Particle Physics Behnke, Kroeninger, Schott, Schoerner-Sadenius: Data Analysis in High Energy Physics: A Practical Guide to Statistical Methods

CausalDiagrams A E F C D B Figure1. Exampleofcausaldiagram. reinterpretedformallyasprobabilitymod-els .

agree with Josef Honerkamp who in his book Statistical Physics notes that statistical physics is much more than statistical mechanics. A similar notion is expressed by James Sethna in his book Entropy, Order Parameters, and Complexity. Indeed statistical physics teaches us how to think about

Lesson 1: Posing Statistical Questions Student Outcomes Students distinguish between statistical questions and those that are not statistical. Students formulate a statistical question and explain what data could be collected to answer the question. Students distingui

to calculate the observables. The term statistical mechanics means the same as statistical physics. One can call it statistical thermodynamics as well. The formalism of statistical thermodynamics can be developed for both classical and quantum systems. The resulting energy distribution and calculating observables is simpler in the classical case.

published by the American Petroleum Institute (API, 1984, 1991) are generally not consistent with the physical processes that dictate actual pile capacity. For example, the experimental observa- tion of a gradual reduction in the rate of increase of pile capacity with embedment depth is allowed for by imposing limiting values of end-bearing and shaft friction beyond some critical depth .