4. Introduction To Statistics Descriptive Statistics

3y ago
47 Views
3 Downloads
570.52 KB
17 Pages
Last View : 25d ago
Last Download : 3m ago
Upload by : Camille Dion
Transcription

Statistics for Engineers 4-14. Introduction to StatisticsDescriptive StatisticsTypes of dataA variate or random variable is a quantity or attribute whose value may vary from oneunit of investigation to another. For example, the units might be headache sufferers andthe variate might be the time between taking an aspirin and the headache ceasing.An observation or response is the value taken by a variate for some given unit.There are various types of variate. Qualitative or nominal; described by a word or phrase (e.g. blood group, colour). Quantitative; described by a number (e.g. time till cure, number of calls arrivingat a telephone exchange in 5 seconds). Ordinal; this is an "in-between" case. Observations are not numbers but they canbe ordered (e.g. much improved, improved, same, worse, much worse).Averages etc. can sensibly be evaluated for quantitative data, but not for the other two.Qualitative data can be analysed by considering the frequencies of different categories.Ordinal data can be analysed like qualitative data, but really requires special techniquescalled nonparametric methods.Quantitative data can be: Discrete: the variate can only take one of a finite or countable number of values(e.g. a count) Continuous: the variate is a measurement which can take any value in an intervalof the real line (e.g. a weight).Displaying dataIt is nearly always useful to use graphical methods to illustrate your data. We shalldescribe in this section just a few of the methods available.Discrete data: frequency table and bar chartSuppose that you have collected some discrete data. It will be difficult to get a "feel" forthe distribution of the data just by looking at it in list form. It may be worthwhileconstructing a frequency table or bar chart.

Statistics for Engineers 4-2The frequency of a value is the number of observations taking that value.A frequency table is a list of possible values and their frequencies.A bar chart consists of bars corresponding to each of the possible values, whose heightsare equal to the frequencies.ExampleThe numbers of accidents experienced by 80 machinists in a certain industry over aperiod of one year were found to be as shown below. Construct a frequency table anddraw a bar 000000000028200000001201000001010110SolutionNumber ofaccidents012345678TalliesFrequency 55145202101 BarchartNumber of accidents in one year60Frequency504030201000123456Number of accidents78

Statistics for Engineers 4-3Continuous data: histogramsWhen the variate is continuous, we do not look at the frequency of each value, but groupthe values into intervals. The plot of frequency against interval is called a histogram. Becareful to define the interval boundaries unambiguously.ExampleThe following data are the left ventricular ejection fractions (LVEF) for a group of 99heart transplant patients. Construct a frequency table and 66577882Frequency tableLVEF24.5 - 34.534.5 - 44.544.5 - 54.554.5 - 64.564.5 - 74.574.5 - 84.5Tallies Frequency113134536HistogramHistogram of LVEF50Frequency403020100304050607080LVEFNote: if the interval lengths are unequal, the heights of the rectangles are chosen so thatthe area of each rectangle equals the frequency i.e. height of rectangle frequency interval length.

Statistics for Engineers 4-4Things to look out forBar charts and histograms provide an easily understood illustration of the distribution ofthe data. As well as showing where most observations lie and how variable the data are,they also indicate certain "danger signals" about the data.Normally distributed data100FrequencyThe histogram is bell-shaped, like theprobability density function of a Normaldistribution. It appears, therefore, that thedata can be modelled by a Normaldistribution. (Other methods for checkingthis assumption are available.)500245250255BSFCSimilarly, the histogram can be used to seewhether data look as if they are from anExponential or Uniform distribution.3530Very skew dataFrequency20The relatively few large observations canhave an undue influence when comparing twoor more sets of data. It might be worthwhileusing a transformation e.g. taking logarithms.BimodalityThis may indicate the presence of two subpopulations with different characteristics. Ifthe subpopulations can be identified it mightbe better to analyse them separately.1510500100200300Time till failure (hrs)4030Frequency201005060708090100 110 120 130 140Time till failure (hrs)OutliersThe data appear to follow a pattern with theexception of one or two values. You need todecide whether the strange values are simplymistakes, are to be expected or whether theyare correct but unexpected. The outliers mayhave the most interesting story to tell.4030Frequency20100405060708090 100 110 120 130 140Time till failure (hrs)

Statistics for Engineers 4-5Summary StatisticsMeasures of locationBy a measure of location we mean a value which typifies the numerical level of a set ofobservations. (It is sometimes called a "central value", though this can be a misleadingname.) We shall look at three measures of location and then discuss their relative merits.Sample meanThe sample mean of the valuesis̅ This is just the average or arithmetic mean of the values. Sometimes the prefix "sample"is dropped, but then there is a possibility of confusion with the population mean which isdefined later.Frequency data: suppose that the frequency of the class with midpoint., m). Then̅Where total number of observations.ExampleAccidents data: find the sample mean.Number ofaccidents, xi012345678TOTALFrequencyfi5514520210180f i xi01410601060854̅is , for i 1, 2,

Statistics for Engineers 4-6Sample medianThe median is the central value in the sense that there as many values smaller than it asthere are larger than it.All values known: if there are n observations then the median is: the the sample mean of the largest and thelargest value, if n is odd;largest values, if n is even.ModeThe mode, or modal value, is the most frequently occurring value. For continuous data,the simplest definition of the mode is the midpoint of the interval with the highestrectangle in the histogram. (There is a more complicated definition involving thefrequencies of neighbouring intervals.) It is only useful if there are a large number ofobservations.Comparing mean, median and modeHistogram of reaction timesFrequency30Symmetric data: the mean median and modewill be approximately equal.201000.20.30.40.50.60.70.80.91.01.1Reaction time (sec)Skew data: the median is less sensitive than the mean to extreme observations. The modeignores them.IFS Briefing Note No 73Mode

Statistics for Engineers 4-7The mode is dependent on the choice of class intervals and is therefore not favoured forsophisticated work.Sample mean and median: it is sometimes said that the mean is better for symmetric, wellbehaved data while the median is better for skewed data, or data containing outliers. Thechoice really mainly depends on the use to which you intend putting the "central" value.If the data are very skew, bimodal or contain many outliers, it may be questionablewhether any single figure can be used, much better to plot the full distribution. For moreadvanced work, the median is more difficult to work with. If the data are skewed, it maybe better to make a transformation (e.g. take logarithms) so that the transformed data areapproximately symmetric and then use the sample mean.Statistical InferenceProbability theory: the probability distribution of the population is known; we want toderive results about the probability of one or more values ("random sample") - deduction.Statistics: the results of the random sample are known; we want to determine somethingabout the probability distribution of the population - inference.PopulationSampleIn order to carry out valid inference, the sample must be representative, and preferably arandom sample.Random sample: two elements: (i) no bias in the selection of the sample;(ii) different members of the sample chosen independently.Formal definition of a random sample:are a random sample if eachthe same distribution and the 's are all independent.hasParameter estimationWe assume that we know the type of distribution, but we do not know the value of theparameters , say. We want to estimate ,on the basis of a random sample.Let’s call the random sampleour data D. We wish to inferwhichby Bayes’ theorem is

Statistics for Engineers 4-8is called the prior, which is the probability distribution from any prior informationwe had before looking at the data (often this is taken to be a constant). The denominatorP(D) does not depend on the parameters, and so is just a normalization constant.is called the likelihood: it is how likely the data is given a particular set of parameters.The full distributiongives all the information about the probability of differentparameters values given the data. However it is often useful to summarise thisinformation, for example giving a peak value and some error bars.Maximum likelihood estimator: the value of θ that maximizes the likelihoodiscalled the maximum likelihood estimate: it is the value that makes the data most likely,and if P(θ) does not depend on parameters (e.g. is a constant) is also the most probablevalue of the parameter given the observed data.The maximum likelihood estimator is usually the best estimator, though in someinstances it may be numerically difficult to calculate. Other simpler estimators aresometimes possible. Estimates are typically denoted by: ̂ , etc. Note that since P(D θ)is positive, maximizing P(D θ) gives the same as maximizing log P(D θ).Example Random samplesare drawn from a Normal distribution. What isthe maximum likelihood estimate of the mean μ?SolutionWe find the maximum likelihood by maximizing the log likelihood, here logSo for a maximum likelihood estimate of we want The solution is the maximum likelihood estimator ̂ with ̂ ̂̅̂ So the maximum likelihood estimator of the mean is just the sample mean we discussedbefore. We can similarly maximize with respect towhen the mean is the maximumlikelihood valuê . This giveŝ ̅.

Statistics for Engineers 4-9Comparing estimatorsA good estimator should have as narrow a distribution as possible (i.e. be close to thecorrect value as possible). Often it is also useful to have it being unbiased, that onaverage (over possible data samples) it gives the true value:The estimator ̂ is unbiased forif ( ̂)for all values of .A good but biased estimatorTruemeanA poor but unbiased estimator̅ is an unbiased estimator of .Result: ̂〈 ̅〉〈〉〈〉〈〉〈〉Result: ̂ is a biased estimator of σ2.〈̂〉〈 〈〉̂ 〉〈( 〈) 〉 〈where we used 〈〉〈 〉〈 〉〉〈 ̂ 〉 〈〉〈 for independent variables ( 〉.〉

Statistics for Engineers 4-10Sample varianceSince ̂ is a biased estimator of σ2 it is common to use the unbiased estimator of thevariance, often called the sample variance:̂ ̅ ̅ ̅The last form is often more convenient to calculate, but also less numerically stable (youare taking the difference of two potentially large numbers).Why the?We showed that 〈 ̂ 〉the variance., and hence that〈 ̂ 〉 is an unbiased estimate ofIntuition: the reason the estimator is biased is because the mean is also estimated fromthe same data. It is not biased if you know the true mean and can use μ instead of ̅ : Oneunit of information has to be used to estimate the mean, leaving n-1 units to estimate thevariance. This is very obvious with only one data point X1: if you know the true meanthis still tells you something about the variance, but if you have to estimate the mean aswell – best guess X1 – you have nothing left to learn about the variance. This is why theunbiased estimator is undefined for n 1.Intuition 2: the sample mean is closer to the centre of the distribution of the samples thanthe true (population) mean is, so estimating the variance using the r.m.s. distance fromthe sample mean underestimates the variance (which is the scatter about the populationmean).For a normal distribution the estimator ̂ is the maximum likelihood value when̂i.e. the mean fixed to its maximum value. If we averaged over possible values of the truemean (a process called marginalization), and then maximized this averaged distribution,we would have found is the maximum likelihood estimator. i.e. accounts foruncertainty in the true mean. For large the mean is measured accurately, andMeasures of dispersionA measure of dispersion is a value which indicates the degree of variability of data.Knowledge of the variability may be of interest in itself but more often is required inorder to decide how precisely the sample mean – and estimator of the mean - reflects thepopulation (true) mean.A measure of dispersion in the original units as the data is the standard deviation, whichis just the (positive) square root of the sample variance:.

Statistics for Engineers 4-11For frequency data, wherem):is the frequency of the class with midpoint xi (i 1, 2, .,̅̅ ̅ ̅ ̅Example Find the sample mean and standard deviation of the following: 6, 4, 9, 5, 2.Example Evaluate the sample mean and standard deviation, using the frequency table.LVEF24.5 - 34.534.5 - 44.544.5 - 54.554.5 - 64.564.5 - 74.574.5 - 5.00Sample mean, ̅Sample variance,Sample standard deviation, .Note: when using a calculator, work to full accuracy during calculations in order tominimise rounding errors. If your calculator has statistical functions, s is denoted by n-1.Percentiles and the interquartile rangeThe kth percentile is the value corresponding to cumulative relative frequency of k/100on the cumulative relative frequency diagram e.g. the 2nd percentile is the valuecorresponding to cumulative relative frequency 0.02. The 25th percentile is also knownas the first quartile and the 75th percentile is also known as the third quartile. Theinterquartile range of a set of data is the difference between the third quartile and the firstquartile, or the interval between these values. It is the range within which the "middlehalf" of the data lie, and so is a measure of spread which is not too sensitive to one or twooutliers.

Statistics for Engineers 4-122nd quartile3rd quartile1st quartile0.02 percentileInterquartile rangeRangeThe range of a set of data is the difference between the maximum and minimum values,or the interval between these values. It is another measure of the spread of the data.Comparing sample standard deviation, interquartile range and rangeThe range is simple to evaluate and understand, but is sensitive to the odd extreme valueand does not make effective use of all the information of the data. The sample standarddeviation is also rather sensitive to extreme values but is easier to work withmathematically than the interquartile range.Confidence IntervalsEstimates are "best guesses" in some sense, and the sample variance gives some idea ofthe spread. Confidence intervals are another measure of spread, a range within which weare "pretty sure" that the parameter lies.Normal data, variance knownRandom samplefrom,whereis known but is unknown. We wanta confidence interval for .Recall:(i) ̅P 0.025̅(ii) With probability 0.95, a Normal randomvariables lies within 1.96 standard deviations ofthe mean.P 0.025

Statistics for Engineers 4-13̅̅̅Since the variance of the sample mean is̅ (this gives̅ )To infer the distribution of μ given ̅ we need to use Bayes’ theorem̅̅̅ is also Normal with mean ̅ soIf the prior on μ is constant, then̅̅̅̅̅Or(̅̅ A 95% confidence interval foris: ̅ ̅)to ̅ .Two tail versus one tailWhen the distribution has two ends (tails) where the likelihood goes to zero, the mostnatural choice of confidence interval is the regions excluding both tails, so a 95%confidence region means that 2.5% of the probability is in the high tail, 2.5% in the lowtail. If the distribution is one sided, a one tail interval is more appropriate.Example: 95% confidence regions

Statistics for Engineers 4-14Two tail(Normal example)One TailP 0.05P 0.025P 0.025Example: Polling (Binomial data)[unnecessarily complicated example but a useful general result for poll error bars]A sample of 1000 random voters were polled, with 350 saying they will vote for theConservatives and 650 saying another party. What is the 95% confidence interval for theConservative share of the vote?Solution:n Bernoulli trials, X number of people saying they will vote Conservative; X B(n, p).(If n is large, X is approx.̅. The variance oftaken to be(̅)). The mean isand hence the variance of ̅ can beis (or a standard deviation of95% confidence (two-tail) interval is ̅̅so we can estimate ̅ ̅̅̅̅)̅ ̅̅̅̅. Hence theor̅Withthis corresponds to standard deviation of 0.015 and a 95%confidence plus/minus error of 3%:0.35–0.03 p 0.35 0.03 so 0.32 p 0.38.

Statistics for Engineers 4-15Normal data, variance unknown: Student’s t-distributionRandom sampleconfidence interval for .The distribution of̅ from, whereandis called a t-distribution withdegrees of freedom.So the situation is like when we know the variance, when2are unknown. We want a̅2 is normally distributed,but now replacing σ by the sample estimate s . We have to use the t-distribution instead.The fact that you have to estimate the variance from the data --- making true varianceslarger than the estimated sample variance possible --- broadens the tails significantlywhen there are not a large number of data points. As n becomes large, the t-distributionconverges to a normal.Derivation of the t-distribution is a bit tricky, so we’ll just look at how to use it.Ifis known, confidence interval foris ̅ to ̅ , wherefrom Normal tables.Ifis unknown, we need to make two changes:(i) Estimate(ii) replace z byby, the sample variance;, the value obtained from t-tables,The confidence interval foris: ̅ to ̅ .is obtained

Statistics for Engineers 4-16t-tables: these give for different values Q of the cumulativeStudent's t-distributions, and for different values of . Theparameter is called the number of degrees of freedom. Whenthe mean and variance are unknown, there are n-1 degrees offreedom to estimate the variance, and this is the relevantquantity here.Q The t-tables are laid out differently from N(0,1).(Wikipedia: Beer is good for statistics!)For a 95% confidence interval, we want themiddle 95% region, so Q 0.975 (i.e.0.05/2 0.025 in both tails).0.40.3Similarly, for a 99% confidence interval,we would want Q 0.995.0.95tv0.20.0250.10.00Example: From n 20 pieces of data drawn from a Normal distribution have samplemean ̅, and sample variance. What is the 95% confidence interval for thepopulation mean μ?From t-tables,, Q 0.975, t 2.093.95% confidence interval foris: i.e. 9.34 to 10.66

Statistics for Engineers 4-17Sample sizeWhen planning an experiment or series of tests, you need to decide how many repeats tocarry out to obtain a certain level of precision in you estimate. The confidence intervalformula can be helpful.For example, for Normal data, confidence interval foris ̅ .Suppose we want to estimate to within, where (and the degree of confidence) isgiven. We must choose the sample size, n, satisfying: To use this need:(i) an estimate of s2 (e.g. results from previous experiments);(ii) an estimate of. This depends on n, but not very strongly. You willnot go far wrong, in general, if you takefor 95% confidence.Rule of thumb: for 95% confidence, choose

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation to another. For example, the units might be headache sufferers and the variate might be the time between taking an aspirin and the headache ceasing. An observation or response is the value .

Related Documents:

4. Descriptive statistics Any time that you get a new data set to look at one of the first tasks that you have to do is find ways of summarising the data in a compact, easily understood fashion. This is what descriptive statistics (as opposed to inferential statistics) is all about. In fact, to many people the term "statistics" is

Introduction, descriptive statistics, R and data visualization This is the first chapter in the eight-chapter DTU Introduction to Statistics book. It consists of eight chapters: 1.Introduction,descriptive statistics, R and data visualization 2.Probability and simulation 3.Statistical analysis of one and two sample data 4.Statistics by simulation

descriptive statistics available, many of which are described in the preceding section. The example in the above dialog box would produce the following output: Going back to the Frequencies dialog box, you may click on the Statistics button to request additional descriptive statistics. Click

Statistics is a branch of science dealing with collecting, organizing, summarizing, analysing and making decisions from data. Definition 1.1.1 Statistics is divided into two main areas, which are descriptive and inferential statistics. A Descriptive Statistics

CHAPTER 1: INTRODUCTION TO STATISTICS 3 student fidgeted. Presenting a spreadsheet with the number for each individual student is not very clear. For this reason, researchers use descriptive statistics to summarize sets of individual measurements so they can be clearly presented and interpreted. Descriptive statistics are procedures used to summarize, organize, and make sense of a set of .

1 Chapter 1 The Role of Statistics and the Data Analysis Process 1.1 Descriptive statistics is the branch of statistics that involves the organization and summary of the values in a data set. Inferential statistics is the branch of statistics concerned with reaching conclusions about a population based on the information provided by a sample.

Marquette University Executive MBA Program . Statistics Review . Class Notes Summer 2022 . Chapter One: Data and Statistics Play Chapter 1 Discussion 1 . Statistics A collection of procedures and principles for gathering and analyzing data. Descriptive Statistics Methods of organizing, summarizing, and presenting data. Inferential Statistics

In recent years, there has been an increasing amount of literature on . A large and growing body of literature has investigated . In recent years, several studies have focused on