4m ago

12 Views

1 Downloads

1.69 MB

38 Pages

Transcription

The importance of statisticsAnd error analysis

Errors and Data AnalysisTypes of errors:1) Precision errors – these are random errors. These could also be calledrepeatability errors. They are caused by fluctuations in some part (or parts) ofthe data acquisition. These errors can be treated by statistical analysis.2) Bias errors – These are systematic errors. Zero offset, scale errors (nonlinearoutput vs input) , hysteresis, calibration errors, etc. If these are hidden, theyare essentially impossible to correct. These are often negligible ininstruments used for calibration for a long time. But new instruments anddevices can easily have bias errors. For instance, when reducing scales frommeters and millimeters to a scale of nanometers bias errors can creep in dueto unforeseen new effects.3) Analysis errors – wrong theory or wrong analysis applied to data, which areused to ”fit” the data. This is uauslly not considered as a error in the dataacquisition, but nevertheless can waste a lot of time.

Examples of a constant signal and random noisefrom time acquired dataWhere does the “randomness” come from? Counting statistics – small numbers (radioactive decay and photon counting Electronic noise from an electronic circuit Small number fluctuations in number of molecules or nano-sized objects

Some helpful “rules” when dealing with errors of an experimental setup1: As soon as an error from a particular source is seen to besignificantly smaller than other errors present, it is given no furtherconsideration.2: The major concern of most error analyses is the quantitativeestimate of bias errors, and correction of data accordingly whenpossible.3: Whenever feasible, precision errors should be estimated fromrepeated tests or from observed scatter in graphed results.4: In planning an experiment where it appears that significant biaserrors will be present, an effort should be made to ensure that precisionerrors are much smaller.

How to handle data samples of multiple measurements taken of thesame configuration.1Nx i 1 xiNThe mean value of the sample values is:The usual measure of the scatter is the standard deviation, which is the squareroot of the variance:2 N 1Sx xi x i 1 N 1 Example:12Notice that the shape of thehistogram is similar to the familiarnormal (Gaussian) probabilitydistribution. Indeed, most precisionerrors have the characteristic that, asthe sample size becomes large, theshape of the histogram tends to thatof the normal distribution. Thischaracteristic allows many powerfulmethods of statistical analysis to beapplied to the analysis of precisionerrors.

Running StatisticsCalculation trick using the two definitions for μ and σ:You can show the following, which is a faster way to keep arunning calculation of the variance, and has less digital round-offWhile moving through the signal, arunning tally is kept of threeparameters: (1) the number ofsamples already processed, (2) thesum of these samples, and (3) thesum of the squares of the samples(that is, square the value of eachsample and add the result to theaccumulated value). After anynumber of samples have beenprocessed, the mean and standarddeviation can be efficiently calculatedusing only the current value of thethree parameters.

The standard deviation of the mean is:SX SXN12This is NOT the standard deviation of one measurement from the mean ofone set of experiments! If the experiment is carried out in many times datasets, and in each set of data many measurements are taken, the standarddeviation of the mean values of the sets of data have a much lowerstandard deviation than the standard deviation of the values of theindividual sets. That is, there is always less precision error in a samplemean than in the individual measurements, and if the sample size is largeenough the error can be negligible.Remember this is only for the statistical precision error – NOT the biaserror.A statistical analysis of a sample tells a lot about precision errors, having asample tells us nothing about bias errors.

The total error in a measurement is the difference between the measuredvalue and the true value. BUT we do not know what the true value is! If wetake a large enough sample we could say that a good estimate of the biaserror is x xtrue. But the catch is that we do not know xtrue a priori: xtrue is theunknown we want to determine. Thus, determination of bias errors hasnothing to do with samples of data and statistical analysis. To find the biaserrors you have to compare with data from similar instruments, or withstandard measurements, or patiently find the bias in your instrument.

How about least square curve fits – that is, one parameterdepends on another.Take the example of a straight line dependence.y Mx C(xi, yi); i 1, 2, . . . , Nassume that y has significant precision error,but the x precision error is negligibleSum of squared of differences

How to determine the slope and interceptStandard error for the curve fit is defined as: 12 SY Di N 2 12

Comments: It was assumed that all the variance was in “y”. If “x” also has significant variance,the expressions are more complex. If the plot is seen to be nonlinear, maybe we can linearize the data: for instanceIf y ae kx , then ln y ln a kx; plot ln y vs x; slope - k , and intercept ln a.If y axn ; then ln y ln a n ln x ; plot ln y vs ln x Often the data points can be fit to several models. If you are testing a theory youknow the model; or maybe you are searching for a hint for a theory. How do you handle outliers (see figure below and later)?

Another type of “outlier”

UncertaintyWe do not know the actual value of the parameter(s) we are measuring – weonly know an estimate of this value. So we have to deal with estimated – orprobable - errors. If we say we are C% confident that the true value Xtrue of ameasurement Xi lies within the interval Xi PX : then PX is called the precisionuncertainty at a confidence level of C%. This means that if we specify a 95%confidence level estimate of PX, we would expect Xtrue to be in the interval Xi PX about 95 times out of a 100.We usually assume a normal distribution if N 10; then PX isapproximately 2x the standard deviation for 95% confidence:PX 2S X C 95%, N 10 This is the uncertainty at 95% confidence for individual samplesdrawn from a normal population and the total sample is largeFor small samples this must be amended – so always try to keep N 10.

Now what about the precision in the uncertainty of the value of the mean ofrepeated sets of measurements, each set consisting of a certain number ofindividual measurements?Remember:Sx SXN12Then the corresponding precisionPX 2S X C 95%, N 10uncertainty in the sample mean is:So, The probable error in a sample mean is much less than in the individualmeasurements. Why is this important? We usually average individual measurements over a time interval before recordingthe averaged values. When precision error is important, we usually are interested in the sample mean, notin individual measurements in any particular set of measurements.

Can we know estimates of our error when we only take asingle measurement?Yes, if we have independent data for the variance of themeasurement from previous measurements, or from anexamination of the instrument from the factory or fromcontrol measurements. But in general it is best to takeseveral measurements.

How about the precision error for a curve fit? Then one can show:Yˆ for a curve-fit is like a “mean” value analogous to X for asample of values of a single variable.PY is always larger than PYˆ ,just like Px is larger than PxPYˆ depends on how far x is awayfrom x : it is a minimum at x xThe range where thecurve fits will fall 95% ofthe time for repeatedsets of measurementsThe range in which weare 95% confident asingle data point will fall

Bias uncertainty differs from precision uncertainty: We are usually concerned with the precisionuncertainty of a sample mean or a curve-fit. Precision uncertainties can be reduced by increasingthe number of data points used. Bias uncertainty is independent of sample size: it isthe same for one data point as for a sample of 100data points.

The Normal Probability DistributionThe probability density functionfor a random variable X having anormal distributionA single measurement thatis assumed to be from anormal parent population.

Confidence levelsE.g. probability that a measurement willfall within 1 standard deviation.1P 1 z 1 2 The probability that a measurement will fall within a certainfraction of standard deviations (σ’s) of the mean:1 1e z2 2dz

t-statistics – small number of samplesRemember1Nx i 1 xiN2 N 1Sx xi x i 1 N 1 12 X N12N 1 degrees of freedomThe precision uncertainty PX of an individual measurement at a confidence level C% isdefined such that we are C% sure that the population mean μ lies in the interval Xi PX.BUT we do not know the population standard deviation σ.The smaller the # ofsamples, the larger is “t”

How well does the sample mean X estimate the population mean μ?Because the sample means are normally distributed, the t-distribution can be used:That is, one can say with C% confidence that thepopulation mean is within t ,% S X of X .NOTE: Sample means arenormally distributed even whenthe parent population is notGaussian.What do you do with outliers? Find the problem, or if there is no reason found use:Chauvenet’s criterion is recommended: It states that pointsshould be discarded if the probability (calculated from thenormal distribution) of obtaining their deviation from the mean isless than 1/2N.Ratio of the maximumacceptable deviation tothe standard deviation isgiven as a function of N.

Standard error of a fit to a straight lineYi is a random variable and can be taken tohave a normal distribution for each value of xi.For N large and a95% confidencelevel, we settν,% 2

Standard error of a fit to a straight lineStandard deviation for the slopePrecision uncertainty for the slopeStandard deviation for the interceptPrecision uncertainty for the interceptAnd for N large and a 95% confidence level,we set tν,% 2

The Correlation CoefficientX i X Yi Y 1r N 1S X SYIn statistics practice a straight-line curvefit is considered reliable for 0.9 r 1(the sign indicates that Y increases ordecreases with X).The correlation coefficient is useful whenprecision errors are large, such as inexperiments in the life sciences andmedicine. Then the central question iswhether there is any correlationwhatsoever. In physics and engineeringexperiments the precision errors areusually much smaller and the precisionuncertainties of Yˆ , m, and C are moreuseful.

But be careful! You can correlate anything,even if ill or subjectively defined.

Autocorrelation shows how similar data is over certain distancescorrelation between observationsseparated by k time stepsAutocovarianceA plot showing 100 random numbers with a"hidden" sine function, and an autocorrelation(correlogram) of the series on the bottom.c0 is the on

Propagation of Precision UncertaintiesSay Y is a function of N independent measurements Xi. If the uncertainties Pi aresmall enough we can use a first order Taylor expansion of Y to writeSince Y is a linear function of the independent variables, a theorem ofmathematical statistics says:orAll the uncertainties in the Xi must be at the same confidence level.If Y depends only on a product of the independent measurements Xithen

What about: weighting, Precision and accuracy Histograms Poisson statistics Non-linear fitting Chi-square analysis

WeightingExamples of signals generated from non-stationary processes. In (a), both the meanand standard deviation change. In (b), the standard deviation remains a constantvalue of one, while the mean changes from a value of zero to two. It is a commonanalysis technique to break these signals into short segments, and calculate thestatistics of each segment individually.Least Square fittingof a straight lineminimizeS i wi Yi yi i21Y yi 2 i i2

If the variance varies, you want to minimize chi-squaren( yi y fit )i 12 i 22Goodness of fit parameter that shouldbe unity for a “fit within error”2 reduced 1 n( yi y fit ) 2i 1 i2 is the # of degrees of freedom n-# of parameters fitted

2 caveats Chi-square lower than unity is meaningless if youtrust your s2 estimates in the first place. Fitting too many parameters will lower c2 but this maybe just doing a better and better job of fitting the noise! A fit should go smoothly THROUGH the noise, notfollow it! There is such a thing as enforcing a “parsimonious” fitby minimizing a quantity a bit more complicated than c2.This is done when you have a-priori information that thefitted line must be “smooth”.

Graphical description of precision and accuracyPoor accuracy results from systematic errors.Precision is a measure of random noise. Averaging severalmeasurements will always improve the precision.

Poisson distribution:λ mean value,k # of times observedProbability of observing koccurrences in tine t, where λis the average rate per timeThe variance is equal to the meanThe classic Poisson example is the data set ofvon Bortkiewicz (1898), for the chance of aPrussian cavalryman being killed by the kickof a /PoissonDistribution.html

Comparison of the Poissondistribution (black dots) andthe binomial distribution withn 10 (red line), n 20 (blueline), n 1000 (green line). Alldistributions have a mean of5. The horizontal axis showsthe number of events k.http://en.wikipedia.org/wiki/Poisson distribution

Histograms

(a) the histogram, (b) the probability mass function (pmf)and (c) the probability density function (pdf)The amplitude of these three curves is determined by: (a) the sum of the values in thehistogram being equal to the number of samples in the signal; (b) the sum of the values inthe pmf being equal to one, and (c) the area under the pdf curve being equal to one.

Examples ofprobability density functions.

Errors and Data Analysis Types of errors: 1) Precision errors - these are random errors. These could also be called repeatability errors. They are caused by fluctuations in some part (or parts) of the data acquisition. These errors can be treated by statistical analysis. 2) Bias errors - These are systematic errors. Zero offset, scale .

Related Documents: