1y ago

11 Views

1 Downloads

1.29 MB

37 Pages

Transcription

Basics of Probability and Probability DistributionsPiyush Rai(IITK)Basics of Probability and Probability Distributions1

Some Basic Concepts You Should Know AboutRandom variables (discrete and continuous)Probability distributions over discrete/continuous r.v.’sNotions of joint, marginal, and conditional probability distributionsProperties of random variables (and of functions of random variables)Expectation and variance/covariance of random variablesExamples of probability distributions and their propertiesMultivariate Gaussian distribution and its properties (very important)Note: These slides provide only a (very!) quick review of these things. Please refer to a text such asPRML (Bishop) Chapter 2 Appendix B, or MLAPP (Murphy) Chapter 2 for more detailsNote: Some other pre-requisites (e.g., concepts from information theory, linear algebra, optimization,etc.) will be introduced as and when they are required(IITK)Basics of Probability and Probability Distributions2

Random VariablesInformally, a random variable (r.v.) X denotes possible outcomes of an eventCan be discrete (i.e., finite many possible outcomes) or continuousSome examples of discrete r.v.A random variable X {0, 1} denoting outcomes of a coin-tossA random variable X {1, 2, . . . , 6} denoteing outcome of a dice rollSome examples of continuous r.v.A random variable X (0, 1) denoting the bias of a coinA random variable X denoting heights of students in this classA random variable X denoting time to get to your hall from the department(IITK)Basics of Probability and Probability Distributions3

Discrete Random VariablesFor a discrete r.v. X , p(x) denotes the probability that p(X x)p(x) is called the probability mass function (PMF)Xp(x) 0p(x) 1p(x)1 x(IITK)Basics of Probability and Probability Distributions4

Continuous Random VariablesFor a continuous r.v. X , a probability p(X x) is meaninglessInstead we use p(X x) or p(x) to denote the probability density at X xFor a continuous r.v. X , we can only talk about probability within an interval X (x, x δx)p(x)δx is the probability that X (x, x δx) as δx 0The probability density p(x) satisfies the followingZp(x) 0 andp(x)dx 1(note: for continuous r.v., p(x) can be 1)x(IITK)Basics of Probability and Probability Distributions5

A word about notation.p(.) can mean different things depending on the contextp(X ) denotes the distribution (PMF/PDF) of an r.v. Xp(X x) or p(x) denotes the probability or probability density at point xActual meaning should be clear from the context (but be careful)Exercise the same care when p(.) is a specific distribution (Bernoulli, Beta, Gaussian, etc.)The following means drawing a random sample from the distribution p(X )x p(X )(IITK)Basics of Probability and Probability Distributions6

Joint Probability DistributionJoint probability distribution p(X , Y ) models probability of co-occurrence of two r.v. X , YFor discrete r.v., the joint PMF p(X , Y ) is like a table (that sums to 1)XXp(X x, Y y ) 1xyFor continuous r.v., we have joint PDF p(X , Y )Z Zp(X x, Y y )dxdy 1x(IITK)yBasics of Probability and Probability Distributions7

Marginal Probability DistributionIntuitively, the probability distribution of one r.v. regardless of the value the other r.v. takesPPFor discrete r.v.’s: p(X ) y p(X , Y y ), p(Y ) x p(X x, Y )For discrete r.v. it is the sum of the PMF table along the rows/columnsFor continuous r.v.: p(X ) Ryp(X , Y y )dy ,p(Y ) Rxp(X x, Y )dxNote: Marginalization is also called “integrating out”(IITK)Basics of Probability and Probability Distributions8

Conditional Probability Distribution- Probability distribution of one r.v. given the value of the other r.v.- Conditional probability p(X Y y ) or p(Y X x): like taking a slice of p(X , Y )- For a discrete distribution:- For a continuous distribution1 :1 Picture courtesy: Computer vision: models, learning and inference (Simon Price)(IITK)Basics of Probability and Probability Distributions9

Some Basic RulesSum rule: Gives the marginal probability distribution from joint probability distributionFor discrete r.v.: p(X ) PFor continuous r.v.: p(X ) p(X , Y )YRYp(X , Y )dYProduct rule: p(X , Y ) p(Y X )p(X ) p(X Y )p(Y )Bayes rule: Gives conditional probabilityp(Y X ) For discrete r.v.: p(Y X ) p(X Y )p(Y )p(X )Pp(X Y )p(Y )Y p(X Y )p(Y )For continuous r.v.: p(Y X ) R p(X Y )p(Y )Y p(X Y )p(Y )dYAlso remember the chain rulep(X1 , X2 , . . . , XN ) p(X1 )p(X2 X1 ) . . . p(XN X1 , . . . , XN 1 )(IITK)Basics of Probability and Probability Distributions10

Independence Y ) when knowing one tells nothing about the otherX and Y are independent (X p(X Y y ) p(X )p(Y X x) p(Y )p(X , Y ) p(X )p(Y ) Y is also called marginal independenceX Conditional independence (X Y Z ): independence given the value of another r.v. Zp(X , Y Z z) p(X Z z)p(Y Z z)(IITK)Basics of Probability and Probability Distributions11

ExpectationExpectation or mean µ of an r.v. with PMF/PDF p(X )XE[X ] xp(x)(for discrete distributions)xZE[X ] xp(x)dx(for continuous distributions)xNote: The definition applies to functions of r.v. too (e.g., E[f (X )])Linearity of expectationE[αf (X ) βg (Y )] αE[f (X )] βE[g (Y )](a very useful property, true even if X and Y are not independent)Note: Expectations are always w.r.t. the underlying probability distribution of the random variableinvolved, so sometimes we’ll write this explicitly as Ep() [.], unless it is clear from the context(IITK)Basics of Probability and Probability Distributions12

Variance and CovarianceVariance σ 2 (or “spread” around mean µ) of an r.v. with PMF/PDF p(X )var[X ] E[(X µ)2 ] E[X 2 ] µ2pStandard deviation: std[X ] var[X ] σFor two scalar r.v.’s x and y , the covariance is defined bycov[x, y ] E [{x E[x]}{y E[y ]}] E[xy ] E[x]E[y ]For vector r.v. x and y , the covariance matrix is defined as cov[x, y ] E {x E[x]}{y T E[y T ]} E[xy T ] E[x]E[y ]Cov. of components of a vector r.v. x: cov[x] cov[x, x]Note: The definitions apply to functions of r.v. too (e.g., var[f (X )])Note: Variance of sum of independent r.v.’s: var[X Y ] var[X ] var[Y ](IITK)Basics of Probability and Probability Distributions13

Transformation of Random VariablesSuppose y f (x) Ax b be a linear function of an r.v. xSuppose E[x] µ and cov[x] ΣExpectation of yE[y ] E[Ax b] Aµ bCovariance of ycov[y ] cov[Ax b] AΣATLikewise if y f (x) a T x b is a scalar-valued linear function of an r.v. x:E[y ] E[a T x b] a T µ bvar[y ] var[a T x b] a T ΣaAnother very useful property worth remembering(IITK)Basics of Probability and Probability Distributions14

Common Probability DistributionsImportant: We will use these extensively to model data as well as parametersSome discrete distributions and what they can model:Bernoulli: Binary numbers, e.g., outcome (head/tail, 0/1) of a coin tossBinomial: Bounded non-negative integers, e.g., # of heads in n coin tossesMultinomial: One of K ( 2) possibilities, e.g., outcome of a dice rollPoisson: Non-negative integers, e.g., # of words in a document. and many othersSome continuous distributions and what they can model:Uniform: numbers defined over a fixed rangeBeta: numbers between 0 and 1, e.g., probability of head for a biased coinGamma: Positive unbounded real numbersDirichlet: vectors that sum of 1 (fraction of data points in different clusters)Gaussian: real-valued numbers or real-valued vectors. and many others(IITK)Basics of Probability and Probability Distributions15

Discrete Distributions(IITK)Basics of Probability and Probability Distributions16

Bernoulli DistributionDistribution over a binary r.v. x {0, 1}, like a coin-toss outcomeDefined by a probability parameter p (0, 1)P(x 1) pDistribution defined as: Bernoulli(x; p) p x (1 p)1 xMean: E[x] pVariance: var[x] p(1 p)(IITK)Basics of Probability and Probability Distributions17

Binomial DistributionDistribution over number of successes m (an r.v.) in a number of trialsDefined by two parameters: total number of trials (N) and probability of each success p (0, 1)Can think of Binomial as multiple independent Bernoulli trials N mDistribution defined asBinomial(m; N, p) p (1 p)N mmMean: E[m] NpVariance: var[m] Np(1 p)(IITK)Basics of Probability and Probability Distributions18

Multinoulli DistributionAlso known as the categorical distribution (models categorical variables)ThinkPof a random assignment of an item to one of K bins - a K dim. binary r.v. x with single 1K(i.e., k 1 xk 1): Modeled by a multinoulli[0 00.0{z10length K0]}Let vector p [p1 , p2 , . . . , pK ] define the probability of going to each binpk (0, 1) is the probability that xk 1 (assigned to bin k)PKk 1 pk 1The multinoulli is defined as: Multinoulli(x; p) QKk 1pkxkMean: E[xk ] pkVariance: var[xk ] pk (1 pk )(IITK)Basics of Probability and Probability Distributions19

Multinomial DistributionThink of repeating the Multinoulli N timesLike distributing N items to K bins. Suppose xk is count in bin kKX0 xk N k 1, . . . , K ,xk Nk 1Assume probability of going to each bin: p [p1 , p2 , . . . , pK ]Multonomial models the bin allocations via a discrete vector x of size K[x1x2. . . xk 1xkxk 1 . . .xK ]Nx1 x2 . . . xK YKDistribution defined as Multinomial(x; N, p) Mean: E[xk ] Npkpkxkk 1Variance: var[xk ] Npk (1 pk )Note: For N 1, multinomial is the same as multinoulli(IITK)Basics of Probability and Probability Distributions20

Poisson DistributionUsed to model a non-negative integer (count) r.v. kExamples: number of words in a document, number of events in a fixed interval of time, etc.Defined by a positive rate parameter λDistribution defined asPoisson(k; λ) λk e λk!k 0, 1, 2, . . .Mean: E[k] λVariance: var[k] λ(IITK)Basics of Probability and Probability Distributions21

Continuous Distributions(IITK)Basics of Probability and Probability Distributions22

Uniform DistributionModels a continuous r.v. x distributed uniformly over a finite interval [a, b]Uniform(x; a, b) 1b a(b a)22var[x] (b a)12Mean: E[x] Variance:(IITK)Basics of Probability and Probability Distributions23

Beta DistributionUsed to model an r.v. p between 0 and 1 (e.g., a probability)Defined by two shape parameters α and βBeta(p; α, β) Mean: E[p] Γ(α β) α 1p(1 p)β 1Γ(α)Γ(β)αα βVariance: var[p] αβ(α β)2 (α β 1)Often used to model the probability parameter of a Bernoulli or Binomial (also conjugate to thesedistributions)(IITK)Basics of Probability and Probability Distributions24

Gamma DistributionUsed to model positive real-valued r.v. xDefined by a shape parameters k and a scale parameter θxGamma(x; k, θ) x k 1 e θθk Γ(k)Mean: E[x] kθVariance: var[x] kθ2Often used to model the rate parameter of Poisson or exponential distribution (conjugate to both),or to model the inverse variance (precision) of a Gaussian (conjuate to Gaussian if mean known)Note: There is another equivalent parameterization of gamma in terms of shape and rate parameters (rate 1/scale). Another related distribution: Inverse gamma.(IITK)Basics of Probability and Probability Distributions25

Dirichlet DistributionUsed to model non-negative r.v. vectors p [p1 , . . . , pK ] that sum to 1KX0 pk 1, k 1, . . . , K andpk 1k 1Equivalent to a distribution over the K 1 dimensional simplexDefined by a K size vector α [α1 , . . . , αK ] of positive realsPKKDistribution defined asΓ( k 1 αk ) Y αk 1pkDirichlet(p; α) QKk 1 Γ(αk ) k 1Often used to model the probability vector parameters of Multinoulli/Multinomial distributionDirichlet is conjugate to Multinoulli/MultinomialNote: Dirichlet can be seen as a generalization of the Beta distribution. Normalizing a bunch ofGamma r.v.’s gives an r.v. that is Dirichlet distributed.(IITK)Basics of Probability and Probability Distributions26

Dirichlet Distribution- For p [p1 , p2 , . . . , pK ] drawn from Dirichlet(α1 , α2 , . . . , αK )Mean: E[pk ] PKαkk 1Variance: var[pk ] αkαk (α0 αkα20 (α0 1)where α0 PKk 1αk- Note: p is a point on (K 1)-simplexPK- Note: α0 k 1 αk controls how peaked the distribution is- Note: αk ’s control where the peak(s) occurPlot of a 3 dim. Dirichlet (2 dim. simplex) for various values of α:Picture courtesy: (IITK)Computer vision: models, learning and inference (SimonPrice)Basicsof Probability and Probability Distributions27

Now comes theGaussian (Normal) distribution.(IITK)Basics of Probability and Probability Distributions28

Univariate Gaussian DistributionDistribution over real-valued scalar r.v. xDefined by a scalar mean µ and a scalar variance σ 2Distribution defined asN (x; µ, σ 2 ) 12πσ 2e (x µ)22σ 2Mean: E[x] µVariance: var[x] σ 2Precision (inverse variance) β 1/σ 2(IITK)Basics of Probability and Probability Distributions29

Multivariate Gaussian DistributionDistribution over a multivariate r.v. vector x RD of real numbersDefined by a mean vector µ RD and a D D covariance matrix ΣN (x; µ, Σ) p1(2π)D Σ 1e 2 (x µ) Σ 1 (x µ)The covariance matrix Σ must be symmetric and positive definiteAll eigenvalues are positivez Σz 0 for any real vector zOften we parameterize a multivariate Gaussian using the inverse of the covariance matrix, i.e., theprecision matrix Λ Σ 1(IITK)Basics of Probability and Probability Distributions30

Multivariate Gaussian: The Covariance MatrixThe covariance matrix can be spherical, diagonal, or full(IITK)Basics of Probability and Probability Distributions31

Some nice properties of theGaussian distribution.(IITK)Basics of Probability and Probability Distributions32

Multivariate Gaussian: Marginals and ConditionalsGiven x having multivariate Gaussian distribution N (x µ, Σ) with Λ Σ 1 . SupposeThe marginal distribution is simplyp(x a ) N (x a µa , Σaa )The conditional distribution is given byThus marginals and conditionalsof Gaussians are Gaussians(IITK)Basics of Probability and Probability Distributions33

Multivariate Gaussian: Marginals and ConditionalsGiven the conditional of an r.v. y and marginal of r.v. x, y is conditioned onMarginal of y and “reverse” conditional are given bywhere Σ (Λ A LA) 1Note that the “reverse conditional” p(x y ) is basically the posterior of x is the prior is p(x)Also note that the marginal p(y ) is the predictive distribution of y after integrating out xVery useful property for probabilistic models with Gaussian likelihoods and/or priors. Also veryhandly for computing marginal likelihoods.(IITK)Basics of Probability and Probability Distributions34

Gaussians: Product of GaussiansPointwise multiplication of two Gaussians is another (unnormalized) Gaussian(IITK)Basics of Probability and Probability Distributions35

Multivariate Gaussian: Linear TransformationsGiven a x Rd with a multivariate Gaussian distributionN (x; µ, Σ)Consider a linear transform of x into y RDy Ax bwhere A is D d and b RDy RD will have a multivariate Gaussian distributionN (y ; Aµ b, AΣA )(IITK)Basics of Probability and Probability Distributions36

Some Other Important DistributionsWishart Distribution and Inverse Wishart (IW) Distribution: Used to model D D p.s.d. matricesWishart often used as a conjugate prior for modeling precision matrices, IW for covariance matricesFor D 1, Wishart is the same as gamma dist., IW is the same as inverse gamma (IG) dist.Normal-Wishart Distribution: Used to model mean and precision matrix of a multivar. GaussianNormal-Inverse Wishart (NIW): : Used to model mean and cov. matrix of a multivar. GaussianFor D 1, the corresponding distr. are Normal-Gamma and Normal-Inverse Gamma (NIG)Student-t Distribution (a more robust version of Normal distribution)Can be thought of as a mixture of infinite many Gaussians with different precisions (or a singleGaussian with its precision/precision matrix given a gamma/Wishart prior and integrated out)Please refer to PRML (Bishop) Chapter 2 Appendix B, or MLAPP (Murphy) Chapter 2 for moredetails(IITK)Basics of Probability and Probability Distributions37

Random variables (discrete and continuous) . concepts from information theory, linear algebra, optimization, etc.) will be introduced as and when they are required (IITK) Basics of Probability and Probability Distributions 2. Random Variables . Uniform: numbers de ned over a xed range Beta: numbers between 0 and 1, e.g., probability of head .

Related Documents: