Statistical Data Analysis Stat 5: More On Nuisance Parameters . - WebHome

1y ago
13 Views
3 Downloads
4.39 MB
52 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Jacoby Zeller
Transcription

Statistical Data AnalysisStat 5: More on nuisance parameters,Bayesian methodsLondon Postgraduate Lectures on Particle Physics;University of London MSci course PH4515Glen CowanPhysics DepartmentRoyal Holloway, University of Londong.cowan@rhul.ac.ukwww.pp.rhul.ac.uk/ cowanCourse web page:www.pp.rhul.ac.uk/ cowan/stat course.htmlG. CowanStatistical Data Analysis / Stat 51

Systematic uncertainties and nuisance parametersIn general our model of the data is not perfect:L (x θ)model:truth:xCan improve model by includingadditional adjustable parameters.Nuisance parameter systematic uncertainty. Some point in theparameter space of the enlarged model should be “true”.Presence of nuisance parameter decreases sensitivity of analysisto the parameter of interest (e.g., increases variance of estimate).G. CowanStatistical Data Analysis / Stat 52

p-values in cases with nuisance parametersSuppose we have a statistic qθ that we use to test a hypothesizedvalue of a parameter θ, such that the p-value of θ isBut what values of ν to use for f (qθ θ, ν)?Fundamentally we want to reject θ only if pθ α for all ν. “exact” confidence intervalRecall that for statistics based on the profile likelihood ratio, thedistribution f (qθ θ, ν) becomes independent of the nuisanceparameters in the large-sample limit.But in general for finite data samples this is not true; one may beunable to reject some θ values if all values of ν must beconsidered, even those strongly disfavoured by the data (resultinginterval for θ “overcovers”).G. CowanStatistical Data Analysis / Stat 53

Profile construction (“hybrid resampling”)Approximate procedure is to reject θ if pθ α wherethe p-value is computed assuming the value of the nuisanceparameter that best fits the data for the specified θ:“double hat” notation meansvalue of parameter that maximizeslikelihood for the given θ.The resulting confidence interval will have the correct coveragefor the points (θ ,ν̂ˆ(θ )) .Elsewhere it may under- or overcover, but this is usually as goodas we can do (check with MC if crucial or small sample problem).G. CowanStatistical Data Analysis / Stat 54

“Hybrid frequentist-Bayesian” methodAlternatively, suppose uncertainty in ν is characterized bya Bayesian prior π(ν).Can use the marginal likelihood to model the data:This does not represent what the data distribution wouldbe if we “really” repeated the experiment, since then ν wouldnot change.But the procedure has the desired effect. The marginal likelihoodeffectively builds the uncertainty due to ν into the model.Use this now to compute (frequentist) p-values the modelbeing tested is in effect a weighted average of models.G. CowanStatistical Data Analysis / Stat 55

Example of treatment of nuisanceparameters: fitting a straight lineData:Model: yi independent and all follow yi Gauss(µ(xi ), σi )assume xi and σi known.Goal: estimate θ0Here suppose we don’t careabout θ1 (example of a“nuisance parameter”)G. CowanStatistical Data Analysis / Stat 56

Maximum likelihood fit with Gaussian dataIn this example, the yi are assumed independent, so thelikelihood function is a product of Gaussians:Maximizing the likelihood is here equivalent to minimizingi.e., for Gaussian data, ML same as Method of Least Squares (LS)G. CowanStatistical Data Analysis / Stat 57

θ1 known a prioriFor Gaussian yi, ML same as LSMinimize χ2 estimatorCome up one unit fromto findG. CowanStatistical Data Analysis / Stat 58

ML (or LS) fit of θ0 and θ1Standard deviations fromtangent lines to contourCorrelation betweencauses errorsto increase.G. CowanStatistical Data Analysis / Stat 59

If we have a measurement t1 Gauss (θ1, σt1)The information on θ1improves accuracy ofG. CowanStatistical Data Analysis / Stat 510

Bayesian methodWe need to associate prior probabilities with θ0 and θ1, e.g.,‘non-informative’, in anycase much broader than based on previousmeasurementPutting this into Bayes’ theorem gives:posteriorG. Cowan likelihood Statistical Data Analysis / Stat 5prior11

Bayesian method (continued)We then integrate (marginalize) p(θ0, θ1 x) to find p(θ0 x):In this example we can do the integral (rare). We findUsually need numerical methods (e.g. Markov Chain MonteCarlo) to do integral.G. CowanStatistical Data Analysis / Stat 512

Digression: marginalization with MCMCBayesian computations involve integrals likeoften high dimensionality and impossible in closed form,also impossible with ‘normal’ acceptance-rejection Monte Carlo.Markov Chain Monte Carlo (MCMC) has revolutionizedBayesian computation.MCMC (e.g., Metropolis-Hastings algorithm) generatescorrelated sequence of random numbers:cannot use for many applications, e.g., detector MC;effective stat. error greater than if all values independent .Basic idea: sample multidimensionallook, e.g., only at distribution of parameters of interest.G. CowanStatistical Data Analysis / Stat 513

MCMC basics: Metropolis-Hastings algorithmGoal: given an n-dimensional pdfgenerate a sequence of points1) Start at some point2) GenerateProposal densitye.g. Gaussian centredabout3) Form Hastings test ratio4) Generate5) Ifelsemove to proposed pointold point repeated6) IterateG. CowanStatistical Data Analysis / Stat 514

Metropolis-Hastings (continued)This rule produces a correlated sequence of points (note howeach new point depends on the previous one).For our purposes this correlation is not fatal, but statisticalerrors larger than if points were independent.The proposal density can be (almost) anything, but chooseso as to minimize autocorrelation. Often take proposaldensity symmetric:Test ratio is (Metropolis-Hastings):I.e. if the proposed step is to a point of higherif not, only take the step with probabilityIf proposed step rejected, hop in place.G. CowanStatistical Data Analysis / Stat 5, take it;15

Example: posterior pdf from MCMCSample the posterior pdf from previous example with MCMC:Summarize pdf of parameter ofinterest with, e.g., mean, median,standard deviation, etc.Although numerical values of answer here same as in frequentistcase, interpretation is different (sometimes unimportant?)G. CowanStatistical Data Analysis / Stat 516

Bayesian method with alternative priorsSuppose we don’t have a previous measurement of θ1 but rather,e.g., a theorist says it should be positive and not too much greaterthan 0.1 "or so", i.e., something likeFrom this we obtain (numerically) the posterior pdf for θ0:This summarizes allknowledge about θ0.Look also at result fromvariety of priors.G. CowanStatistical Data Analysis / Stat 517

A typical fitting problemGiven measurements:and (usually) covariances:Predicted value:control variableexpectation valueparametersbiasOften take:Minimize2/2χe,Equivalent to maximizing L(θ) i.e., least squares sameas maximum likelihood using a Gaussian likelihood function.G. CowanStatistical Data Analysis / Stat 518

Its Bayesian equivalentTakeJoint probabilityfor all parametersand use Bayes’ theorem:To get desired probability for θ, integrate (marginalize) over b: Posterior is Gaussian with mode same as least squares estimator,σθ same as from χ2 χ2min 1. (Back where we started!)G. CowanStatistical Data Analysis / Stat 519

The error on the errorSome systematic errors are well determinedError from finite Monte Carlo sampleSome are less obviousDo analysis in n ‘equally valid’ ways andextract systematic error from ‘spread’ in results.Some are educated guessesGuess possible size of missing terms in perturbation series;vary renormalization scaleCan we incorporate the ‘error on the error’?(cf. G. D’Agostini 1999; Dose & von der Linden 1999)G. CowanStatistical Data Analysis / Stat 520

Motivating a non-Gaussian prior πb(b)Suppose now the experiment is characterized bywhere si is an (unreported) factor by which the systematic error isover/under-estimated.Assume correct error for a Gaussian πb(b) would be siσisys, soWidth of σs(si) reflects‘error on the error’.G. CowanStatistical Data Analysis / Stat 521

Error-on-error function πs(s)A simple unimodal probability density for 0 s 1 withadjustable mean and variance is the Gamma distribution:mean b/avariance b/a2Want e.g. expectation valueof 1 and adjustable standardDeviation σs , i.e.,sIn fact if we took πs (s) inverse Gamma, we could integrate πb(b)in closed form (cf. D’Agostini, Dose, von Linden). But Gammaseems more natural & numerical treatment not too painful.G. CowanStatistical Data Analysis / Stat 522

Prior for bias πb(b) now has longer tailsG. CowanGaussian (σs 0)bP( b 4σsys) 6.3 10-5σs 0.5P( b 4σsys) 0.65%Statistical Data Analysis / Stat 523

A simple testSuppose fit effectively averages four measurements.Take σsys σstat 0.1, uncorrelated.Posterior p(µ y):measurementCase #1: data appear compatibleµexperimentUsually summarize posterior p(µ y)with mode and standard deviation:G. CowanStatistical Data Analysis / Stat 524

Simple test with inconsistent dataPosterior p(µ y):measurementCase #2: there is an outlierµexperiment Bayesian fit less sensitive to outlier. Error now connected to goodness-of-fit.G. CowanStatistical Data Analysis / Stat 525

Goodness-of-fit vs. size of errorIn LS fit, value of minimized χ2 does not affect sizeof error on fitted parameter.In Bayesian analysis with non-Gaussian prior for systematics,a high χ2 corresponds to a larger error (and vice versa).posterior2000 repetitions ofexperiment, σs 0.5,here no actual bias.σµ from least squaresχ2G. CowanStatistical Data Analysis / Stat 526

Is this workable in practice?Should to generalize to include correlationsPrior on correlation coefficients π(ρ)(Myth: ρ 1 is “conservative”)Can separate out different systematic for same measurementSome will have small σs, others larger.Remember the “if-then” nature of a Bayesian result:We can (should) vary priors and see what effect thishas on the conclusions.G. CowanStatistical Data Analysis / Stat 527

Bayesian model selection (‘discovery’)The probability of hypothesis H0 relative to its complementaryalternative H1 is often given by the posterior odds:no HiggsHiggsBayes factor B01prior oddsThe Bayes factor is regarded as measuring the weight ofevidence of the data in support of H0 over H1.Interchangeably use B10 1/B01G. CowanStatistical Data Analysis / Stat 528

Assessing Bayes factorsOne can use the Bayes factor much like a p-value (or Z value).The Jeffreys scale, analogous to HEP's 5σ rule:B10Evidence against H0-------------------------------------------1 to 3Not worth more than a bare mention3 to 20Positive20 to 150Strong 150Very strongKass and Raftery, Bayes Factors, J. Am Stat. Assoc 90 (1995) 773.G. CowanStatistical Data Analysis / Stat 529

Rewriting the Bayes factorSuppose we have models Hi, i 0, 1, .,each with a likelihoodand a prior pdf for its internal parametersso that the full prior iswhereis the overall prior probability for Hi.The Bayes factor comparing Hi and Hj can be writtenG. CowanStatistical Data Analysis / Stat 530

Bayes factors independent of P(Hi)For Bij we need the posterior probabilities marginalized overall of the internal parameters of the models:Use BayestheoremSo therefore the Bayes factor isRatio of marginallikelihoodsThe prior probabilities pi P(Hi) cancel.G. CowanStatistical Data Analysis / Stat 531

Numerical determination of Bayes factorsBoth numerator and denominator of Bij are of the form‘marginal likelihood’Various ways to compute these, e.g., using sampling of theposterior pdf (which we can do with MCMC).Harmonic Mean (and improvements)Importance samplingParallel tempering ( thermodynamic integration)Nested Samplying (MultiNest), .G. CowanStatistical Data Analysis / Stat 532

Priors for Bayes factorsNote that for Bayes factors (unlike Bayesian limits), the priorcannot be improper. If it is, the posterior is only defined up to anarbitrary constant, and so the Bayes factor is ill definedPossible exception allowed if both models contain sameimproper prior; but having same parameter name (or Greekletter) in both models does not fully justify this step.If improper prior is made proper e.g. by a cut-off, the Bayes factorwill retain a dependence on this cut-off.In general or Bayes factors, all priors must reflect “meaningful”degrees of uncertainty about the parameters.G. CowanStatistical Data Analysis / Stat 533

Harmonic mean estimatorE.g., consider only one model and write Bayes theorem as:π(θ) is normalized to unity so integrate both sides,posteriorexpectationTherefore sample θ from the posterior via MCMC and estimate mwith one over the average of 1/L (the harmonic mean of L).G. CowanStatistical Data Analysis / Stat 534

Improvements to harmonic mean estimatorThe harmonic mean estimator is numerically very unstable;formally infinite variance (!). Gelfand & Dey propose variant:Rearrange Bayes thm; multiplyboth sides by arbitrary pdf f(θ):Integrate over θ :Improved convergence if tails of f(θ) fall off faster than L(x θ)π(θ)Note harmonic mean estimator is special case f(θ) π(θ).G. CowanStatistical Data Analysis / Stat 535

Importance samplingNeed pdf f(θ) which we can evaluate at arbitrary θ and alsosample with MC.The marginal likelihood can be writtenBest convergence when f(θ) approximates shape of L(x θ)π(θ).Use for f(θ) e.g. multivariate Gaussian with mean and covarianceestimated from posterior (e.g. with MINUIT).G. CowanStatistical Data Analysis / Stat 536

K. Cranmer/R. Trotta PHYSTAT 2011G. CowanStatistical Data Analysis / Stat 537

Extra slidesG. CowanStatistical Data Analysis / Stat 538

Gross and Vitells, EPJC 70:525-530,2010, arXiv:1005.1891The Look-Elsewhere EffectSuppose a model for a mass distribution allows for a peak ata mass m with amplitude µ.The data show a bump at a mass m0.How consistent is thiswith the no-bump (µ 0)hypothesis?G. CowanStatistical Data Analysis / Stat 539

Local p-valueFirst, suppose the mass m0 of the peak was specified a priori.Test consistency of bump with the no-signal (µ 0) hypothesiswith e.g. likelihood ratiowhere “fix” indicates that the mass of the peak is fixed to m0.The resulting p-valuegives the probability to find a value of tfix at least as great asobserved at the specific mass m0 and is called the local p-value.G. CowanStatistical Data Analysis / Stat 540

Global p-valueBut suppose we did not know where in the distribution toexpect a peak.What we want is the probability to find a peak at least assignificant as the one observed anywhere in the distribution.Include the mass as an adjustable parameter in the fit, testsignificance of peak using(Note m does not appearin the µ 0 model.)G. CowanStatistical Data Analysis / Stat 541

Gross and VitellsDistributions of tfix, tfloatFor a sufficiently large data sample, tfix chi-square for 1 degreeof freedom (Wilks’ theorem).For tfloat there are two adjustable parameters, µ and m, and naivelyWilks theorem says tfloat chi-square for 2 d.o.f.In fact Wilks’ theorem doesnot hold in the floating masscase because on of theparameters (m) is not-definedin the µ 0 model.So getting tfloat distribution ismore difficult.G. CowanStatistical Data Analysis / Stat 542

Approximate correction for LEEGross and VitellsWe would like to be able to relate the p-values for the fixed andfloating mass analyses (at least approximately).Gross and Vitells show the p-values are approximately related bywhere 〈N(c)〉 is the mean number “upcrossings” oftfix -2ln λ in the fit range based on a thresholdand where Zlocal Φ-1(1 – plocal) is the local significance.So we can either carry out the full floating-mass analysis (e.g.use MC to get p-value), or do fixed mass analysis and apply acorrection factor (much faster than MC).G. CowanStatistical Data Analysis / Stat 543

Upcrossings of -2lnLGross and VitellsThe Gross-Vitells formula for the trials factor requires 〈N(c)〉,the mean number “upcrossings” of tfix -2ln λ above a thresholdc tfix,obs found when varying the mass m0 over the range considered.〈N(c)〉 can be estimatedfrom MC (or the realdata) using a much lowerthreshold c0:In this way 〈N(c)〉 can beestimated without need oflarge MC samples, even ifthe the threshold c is quitehigh.G. CowanStatistical Data Analysis / Stat 544

Vitells and Gross, Astropart. Phys. 35 (2011) 230-234; arXiv:1105.4355Multidimensional look-elsewhere effectGeneralization to multiple dimensions: number of upcrossingsreplaced by expectation of Euler characteristic:Applications: astrophysics (coordinates on sky), search forresonance of unknown mass and width, .G. CowanStatistical Data Analysis / Stat 545

Summary on Look-Elsewhere EffectRemember the Look-Elsewhere Effect is when we test a singlemodel (e.g., SM) with multiple observations, i.e, in mulitpleplaces.Note there is no look-elsewhere effect when consideringexclusion limits. There we test specific signal models (typicallyonce) and say whether each is excluded.With exclusion there is, however, the analogous issue of testingmany signal models (or parameter values) and thus excludingsome even in the absence of signal (“spurious exclusion”)Approximate correction for LEE should be sufficient, and oneshould also report the uncorrected significance.“There's no sense in being precise when you don't evenknow what you're talking about.” –– John von NeumannG. CowanStatistical Data Analysis / Stat 546

Why 5 sigma?Common practice in HEP has been to claim a discovery if thep-value of the no-signal hypothesis is below 2.9 10-7,corresponding to a significance Z Φ-1 (1 – p) 5 (a 5σ effect).There a number of reasons why one may want to require sucha high threshold for discovery:The “cost” of announcing a false discovery is high.Unsure about systematics.Unsure about look-elsewhere effect.The implied signal may be a priori highly improbable(e.g., violation of Lorentz invariance).G. CowanStatistical Data Analysis / Stat 547

Why 5 sigma (cont.)?But the primary role of the p-value is to quantify the probabilitythat the background-only model gives a statistical fluctuationas big as the one seen or bigger.It is not intended as a means to protect against hidden systematicsor the high standard required for a claim of an important discovery.In the processes of establishing a discovery there comes a pointwhere it is clear that the observation is not simply a fluctuation,but an “effect”, and the focus shifts to whether this is new physicsor a systematic.Providing LEE is dealt with, that threshold is probably closer to3σ than 5σ.G. CowanStatistical Data Analysis / Stat 548

Jackknife, bootstrap, etc.To estimate a parameter we havevarious tools such as maximumlikelihood, least squares, etc.Usually one also needs to know the variance (or the full samplingdistribution) of the estimator – this can be more difficult.Often use asymptotic properties, e.g., sampling distribution of MLestimators becomes Gaussian in large sample limit; std. dev. fromcurvature of log-likelihood at maximum.The jackknife and bootstrap are examples of “resampling” methodsused to estimate the sampling distribution of statistics.In HEP we often do this implicitly by using Toy MC to determinesampling properties of statistics (e.g., Brazil plot for 1σ, 2σ bandsof limits).G. CowanStatistical Data Analysis / Stat 549

The JackknifeInvented by Quenouille (1949) and Tukey (1958).Suppose data sample consists of n events: x (x1,. xn).ˆ for a parameter θ.We have an estimator θ(x)Idea is to produce pseudo data samples x-i (x1,., xi-1, xi 1,. xn)by leaving out the ith event.Let θˆ-1 be the estimator obtained from the data sample x-i.Suppose the estimator has a nonzero bias:The jackknife estimatorof the bias isSee, e.g., Notes on Jackknife and Bootstrap by G. J. Babu JackknifeBootstrap notes.pdfG. CowanStatistical Data Analysis / Stat 550

The Bootstrap (Efron, 1979)Idea is to produce a set of “bootstrapped” data samplesof same size as the original (real) one by sampling from somedistribution that approximates the true (unknown) one.By evaluating a statistic (such as an estimator for a parameter θ)with the bootstrapped-samples, properties of its samplingdistribution (often its variance) can be estimated.If the data consist of n events, one way to produce thebootstrapped samples is to randomly select from the originalsample n events with replacement (the non-parametric bootstrap).That is, some events might get used multiple times, othersmight not get used at all.In other cases could generate the bootstrapped samples froma parametric MC model, using parameter values estimated fromreal data in the MC (parametric bootstrap).G. CowanStatistical Data Analysis / Stat 551

The Bootstrap (cont.)Call the data sample x (x1,. xn), observed data are xobs,and the bootstrapped samples are x1*, x2*, .Idea is to use the distribution ofas an approximation for the distribution ofIn the first quantity everything is known from the observed dataplus bootstrapped samples, so we can use its distribution toestimate bias, variance, etc. of the estimator θ.ˆG. CowanStatistical Data Analysis / Stat 552

G. Cowan Statistical Data Analysis / Stat 5 29 Assessing Bayes factors One can use the Bayes factor much like a p-value (or Z value). The Jeffreys scale, analogous to HEP's 5σ rule: B 10 Evidence against H 0 ----- 1 to 3 Not worth more than a bare mention 3 to 20 Positive 20 to 150 Strong

Related Documents:

STAT 810: Alpha Seminar STAT 822: Statistical Methods ll STAT 821: Statistical Methods l STAT 883: Mathematical Statistics ll STAT 850: Computing Tools Elective STAT 882: Mathematical Statistics l Choose a faculty advisor and form a MS Supervisory Committee STAT 892*: TA Prep Choose an MS Comprehensive Exam option with the

MET Grid-Stat Tool John Halley Gotway METplus Tutorial July 31 -August 2, 2019 NRL-Monterey, CA. 2 PB2NC ASCII2NC Gridded NetCDF Gridded Forecast Analysis Obs PrepBufr Point STAT ASCII NetCDF Point Obs ASCII . l Grid-Stat, Point-Stat, and Stat-Analysiscan output the ECLV line type.

1 Art: 765874-00 Rev. A Rev. Date: 26-Feb-2020 i-STAT CHEM8 Cartridge Intended for US only. NAME i-STAT CHEM8 Cartridge INTENDED USE The i-STAT CHEM8 cartridge with the i-STAT 1 System is intended for use in the in vitro quantification of sodium, potassium, chloride, ionized calcium, glucose, blood urea nitrogen, creatinine, hematocrit, and total

MedEvac, support this project, and mentor and support me through the project in the midst of a pandemic. 1 1.0 Introduction 1.1 STAT MedEvac Background STAT MedEvac (STAT) is a large air medical service provider based at Allegheny County Airport in West Mifflin, Pennsylvania. STAT operates 18 helicopters, each at its own base, and 4

76 Cochrane Reviews 76 Reviews: 286 meta-analyses Centre for Reviews and Dissemination Binary outcome 194 (68%) Continuous outcome 92 Stat. sig. 178 (62%) Not stat. sig. 108 Trials per MA Median 9 IQR: 6 to 14 Max: 200 Effect size * Median 0.47 If stat sig. 0.69 If not 0.25 I2 I 2 0: 32% I 90%: 7.0% If stat sig. 46% If not: 13%

Stat-JR software system This guide was written by William Browne*, . details of how objects were created in order to increase learning of the statistical methods used. The idea behind a statistical analysis assistant (SAA) is that by asking the user a series of questions . Clearly in practice with real datasets, a dataset of only 15 .

Statistical Methods in Particle Physics WS 2017/18 K. Reygers 1. Basic Concepts Useful Reading Material G. Cowan, Statistical Data Analysis L. Lista, Statistical Methods for Data Analysis in Particle Physics Behnke, Kroeninger, Schott, Schoerner-Sadenius: Data Analysis in High Energy Physics: A Practical Guide to Statistical Methods

To get help on statistical procedures, click on the Contents tab SAS Products SAS/Stat SAS/Stat User's Guide. A list of all SAS/Stat procedures will come up. Click on the procedure that you wish to see. Each procedure has an introduction, a syntax guide, information on statistical algorithms, and examples using SAS code.