Non-parametric Bayesian Methods - University Of Cambridge

1y ago
8 Views
2 Downloads
1.49 MB
55 Pages
Last View : 2d ago
Last Download : 3m ago
Upload by : Lee Brooke
Transcription

Non-parametric Bayesian MethodsUncertainty in Artificial IntelligenceTutorial July 2005Zoubin GhahramaniGatsby Computational Neuroscience Unit1University College London, UKCenter for Automated Learning and DiscoveryCarnegie Mellon University, .uk1Starting Jan 2006:Department of EngineeringUniversity of Cambridge, UK

Bayes Rule Applied to Machine LearningP (D θ)P (θ)P (θ D) P (D)P (D θ)P (θ)P (θ D)likelihood of θprior probability of θposterior of θ given DModel Comparison:P (D m)P (m)P (m D) P (D)ZP (D θ, m)P (θ m) dθP (D m) Prediction:ZP (x θ, D, m)P (θ D, m)dθP (x D, m) ZP (x D, m) P (x θ, m)P (θ D, m)dθ(if x is iid given θ)

Model Comparison: two examples504030y20100 10 20 2024681012xe.g. selecting m, the number of Gaussians ina mixture modelP (D m)P (m)P (m D) ,P (D)A possible procedure:e.g. selecting m the order of a polynomial ina nonlinear regression modelZP (D m) 1. place a prior on m, P (m)2. given data, use Bayes rule to infer P (m D)What is the problem with this procedure?P (D θ, m)P (θ m) dθ

Real data is complicatedExample 1:You are trying to model people’s patterns of movie preferences. You believe thereare “clusters” of people, so you use a mixture model. How should you pick P (m), your prior over how many clusters there are? teenagers,people who like action movies, people who like romantic comedies, people who like horror movies,people who like movies with Marlon Brando, people who like action movies but not sciencefiction, etc etc. Even if there are a few well defined clusters, they are unlikely to be Gaussian inthe variables you measure. To model complicated distributions you might needmany Gaussians for each cluster. Conclusion: any small finite number seems unreasonable

Real data is complicatedExample 2:You are trying to model crop yield as a function of rainfall, amount of sunshine,amount of fertilizer, etc. You believe this relationship is nonlinear, so you decide tomodel it with a polynomial. How should you pick P (m), your prior over what is the order of the polynomial? Do you believe the relationship could be linear? quadratic? cubic? What aboutthe interactions between input variabes? Conclusion: any order polynomial seems unreasonable.How do we adequately capture our beliefs?

Non-parametric Bayesian Models Bayesian methods are most powerful when your prior adequately captures yourbeliefs. Inflexible models (e.g. mixture of 5 Gaussians, 4th order polynomial) yieldunreasonable inferences. Non-parametric models are a way of getting very flexible models. Many can be derived by starting with a finite parametric model and taking thelimit as number of parameters Non-parametric models can automatically infer an adequate modelsize/complexity from the data, without needing to explicitly do Bayesian modelcomparison.22Even if you believe there are infinitely many possible clusters, you can still infer how many clusters are representedin a finite set of n data points.

Outline Introduction Gaussian Processes (GP) Dirichlet Processes (DP), different representations:––––Chinese Restaurant Process (CRP)Urn ModelStick Breaking RepresentationInfinite limit of mixture models and Dirichlet process mixtures (DPM) Hierarchical Dirichlet Processes Infinite Hidden Markov Models Polya Trees Dirichlet Diffusion Trees Indian Buffet Processes

Gaussian ProcessesA Gaussian process defines a distribution over functions, f , where f is a functionmapping some input space X to .f : X .Notice that f can be an infinite-dimensional quantity (e.g. if X )Let’s call this distribution P (f )Let f (f (x1), f (x2), . . . , f (xn)) be an n-dimensional vector of function valuesevaluated at n points xi X . Note f is a random variable.Definition: P (f ) is a Gaussian process if for any finite subset {x1, . . . , xn} X ,the marginal distribution over that finite subset P (f ) has a multivariate Gaussiandistribution.

Gaussian process covariance functionsP (f ) is a Gaussian process if for any finite subset {x1, . . . , xn} X , the marginaldistribution over that finite subset P (f ) has a multivariate Gaussian distribution.Gaussian processes (GPs) are parameterized by a mean function, µ(x), and acovariance function, c(x, x0).P (f (x), f (x0)) N(µ, Σ)where µ µ(x)µ(x0) Σ 0c(x, x) c(x, x )c(x0, x) c(x0, x0) and similarly for P (f (x1), . . . , f (xn)) where now µ is an n 1 vector and Σ is ann n matrix. α xi xj E.g.: c(xi, xj ) v0 exp v1 v2 δij with params (v0, v1, v2, λ, α)λOnce the mean and covariance functions are defined, everything else about GPsfollows from the basic rules of probability applied to mutivariate Gaussians.

Samples from Gaussian processes with different c(x, (x)1010 0.5 0.5 0.50 1 1 1 1 1.5 1.5 20102030405060708090100 1.5010203040x5060708090 2100010203040x360708090 90100x48362412f(x)f(x)f(x)00 1f(x)100 1 2 2 4 1 2 2 3 40102030405060708090 3100010203040x5060708090 3100010203040x35060708090 610031020304050x48320x62214110f(x)f(x)f(x)2f(x)00 10 1 1 2 2 2 2 3 4 4 301020304050x60708090100 301020304050x60708090100 401020304050x60708090100 601020304050x

Using Gaussian processes for nonlinear regressionImagine observing a data set D {(xi, yi)ni 1} (x, y).Model:yi f (xi) if GP(· 0, c) i N(· 0, σ 2)Prior on f is a GP, likelihood is Gaussian, therefore posterior on f is also a GP.We can use this to make predictionsZP (y 0 x0, D) df P (y 0 x0, f, D)P (f D)We can also compute the marginal likelihood (evidence) and use this to compare ortune covariance functionsZP (y x) df P (y f, x)P (f )

Prediction using GPs with different c(x, x0)A sample from the prior for each covariance )f(x)f(x)1010 0.5 0.5 0.50 1 1 1 1 1.5 1.5 20102030405060708090100 1.5010203040x5060708090100 2010203040x5060708090100 2010203040x5060708090100xCorresponding predictions, mean with two standard deviations:10.5120.81.510.50.610.40.5000.200 0.5 0.5 0.5 0.2 1 0.4 1 1 1.5 0.6 1.5051015 0.8051015 2051015 1.5051015

From linear regression to GPs: Linear regression with inputs xi and outputs yi:yi β0 β1xi i Linear regression with K basis functions:yi KXβk φk (xi) ik 1 Bayesian linear regression with basis functions:βk N(· 0, λk )(independent of β , 6 k), i N(· 0, σ 2) Integrating out the coefficients, βj , we find:E[yi] 0,defCov(yi, yj ) Cij Xλk φk (xi) φk (xj ) δij σ 2kThis is a Gaussian process with covariance function c(xi, xj ) Cij .This Gaussian process has a finite number (K) of basis functions. Many useful GPcovariance functions correspond to infinitely many basis functions.

Using Gaussian Processes for ClassificationBinary classification problem: Given a data set D {(xi, yi)}ni 1, with binary classlabels yi { 1, 1}, infer class label probabilities at new points.1.5y 1y 11f50.5f0 51 0.5 1 1010.5 0.50x0.510.5y00 0.5 0.5x 1 1There are many ways to relate function values f (xi) to class probabilities: 1sigmoid (logistic) 1 exp( yf ) Φ(yf )cumulative normal (probit)p(y f ) H(yf )threshold (1 2 )H(yf )robust threshold

Outline Introduction Gaussian Processes (GP) Dirichlet Processes (DP), different representations:––––Chinese Restaurant Process (CRP)Urn ModelStick Breaking RepresentationInfinite limit of mixture models and Dirichlet process mixtures (DPM) Hierarchical Dirichlet Processes Infinite Hidden Markov Models Polya Trees Dirichlet Diffusion Trees Indian Buffet Processes

Dirichlet DistributionThe Dirichlet distribution is a distribution over the K-dim probability simplex.Let p be a K-dimensional vector s.t. j : pj 0 andPKj 1 pj 1PKYα)jα 1def Γ(jP (p α) Dir(α1, . . . , αK ) Qpj jj Γ(αj ) j 13Pwhere the first term is a normalization constant and E(pj ) αj /( k αk )The Dirichlet is conjugate to the multinomial distribution. Letc p Multinomial(· p)That is, P (c j p) pj . Then the posterior is also Dirichlet:P (c j p)P (p α)P (p c j, α) Dir(α0)P (c j α)where αj0 αj 1, and 6 j : α 0 α 3R Γ(x) (x 1)Γ(x 1) 0 tx 1 e t dt. For integer n, Γ(n) (n 1)!

Dirichlet DistributionsExamples of Dirichlet distributions over p (p1, p2, p3) which can be plotted in 2Dsince p3 1 p1 p2:

Dirichlet Processes Gaussian processes define a distribution over functionsf GP(· µ, c)where µ is the mean function and c is the covariance function.We can think of GPs as “infinite-dimensional” Gaussians Dirichlet processes define a distribution over distributions (a measure on measures)G DP(· G0, α)where α 0 is a scaling parameter, and G0 is the base measure.We can think of DPs as “infinite-dimensional” Dirichlet distributions.Note that both f and G are infinite dimensional objects.

Dirichlet ProcessLet Θ be a measurable space, G0 be a probability measure on Θ, and α a positivereal number.For all (A1, . . . AK ) finite partitions of Θ,G DP(· G0, α)means that(G(A1), . . . , G(AK )) Dir(αG0(A1), . . . , αG0(AK ))(Ferguson, 1973)

Dirichlet ProcessG DP(· G0, α)OK, but what does it look like?Samples from a DP are discrete with probability one:G(θ) Xπk δθk (θ)k 1where δθk (·) is a Dirac delta at θk , and θk G0(·).Note: E(G) G0As α , G looks more like G0.

Dirichlet Process: ConjugacyG DP(· G0, α)If the prior on G is a DP:P (G) DP(G G0, α).and you observe θ.P (θ G) G(θ).then the posterior is also a DP: α1G0 δθ , α 1P (G θ) DPα 1α 1 Generalization for n observations:P (G θ1, . . . , θn) DPα1G0 α nα nnX!δ θi , α ni 1Analogous to Dirichlet being conjugate to multinomial observations.

Dirichlet ProcessBlackwell and MacQueen’s (1973) urn representationG DP(· G0, α)andθ G G(·)Thenθn θ1, . . . θn 1, G0, α n 1X1αδθ (·)G0(·) n 1 αn 1 α j 1 jZP (θn θ1, . . . θn 1, G0, α) dGnYj 1The model exhibits a “clustering effect”.P (θj G)P (G G0, α)

Chinese Restaurant Process (CRP)This shows the clustering effect explicitly.Restaurant has infinitely many tables k 1, . . .Customers are indexed by i 1, . . ., with values φiTables have values θk drawn from G0K total number of occupied tables so far.n total number of customers so far.nk number of customers seated at table kGenerating from a CRP:customer 1 enters the restaurant and sits at table 1.φ1 θ1 where θ1 G0, K 1, n 1, n1 1for n 2, . . ., nkkwith prob n 1 αcustomer n sits at tableαK 1 with prob n 1 αfor k 1 . . . K(new table)if new table was chosen then K K 1, θK 1 G0 endifset φn to θk of the table k that customer n sat at; set nk nk 1endforClustering effect: New students entering a school join clubs in proportion to howpopular those clubs already are ( nk ). With some probability (proportional to α),a new student starts a new club.(Aldous, 1985)

Chinese Restaurant Processφ5φ1 φ3θ1φ2θ2φ4φ6θ3θ4.Generating from a CRP:customer 1 enters the restaurant and sits at table 1.φ1 θ1 where θ1 G0, K 1, n 1, n1 1for n 2, . . ., nkkwith prob n 1 αcustomer n sits at tableαK 1 with prob n 1 αfor k 1 . . . K(new table)if new table was chosen then K K 1, θK 1 G0 endifset φn to θk of the table k that customer n sat at; set nk nk 1endforThe resulting conditional distribution over φn:KXαnkφn φ1, . . . , φn 1, G0, α G0(·) δθk (·)n 1 αn 1 αk 1

Relationship between CRPs and DPs DP is a distribution over distributions DP results in discrete distributions, so if you draw n points you are likely to getrepeated values A DP induces a partitioning of the n pointse.g. (1 3 4) (2 5) φ1 φ3 φ4 6 φ2 φ5 CRP is the corresponding distribution over partitions

Dirichlet Processes: Stick Breaking RepresentationG DP(· G0, α) XG(·) Beta(1,1)Beta(1,10)Beta(1,0.5)105πk δθk (·)0k 1where θk G0(·),πk βk15p(β)Samples G from a DP can berepresented as follows:k 1YP k 1 πk0 1,(1 βj )j 1andβk Beta(· 1, α)(Sethuraman, 1994)0.20.4β0.60.81

Other Stick Breaking Processes Dirichlet Process (Sethuraman, 1994):βk Beta(1, α) Beta Two-parameter Process (Ishwaran and Zarepour, 2000):βk Beta(a, b) Pitman-Yor Process (aka two-parameter Poisson-Dirichlet Process; Pitman & Yor(1997)):βk Beta(1 a, b ka)Note: mean of a Beta(a, b) is a/(a b)

Dirichlet Processes: Big PictureThere are many ways to derive the Dirichlet Process: Dirichlet distribution Urn model Chinese restaurant process Stick breaking Gamma process44I didn’t talk about this one

Dirichlet Process MixturesDPs are discrete with probability one, so they are not suitable for use as a prior oncontinuous densities.G0αIn a Dirichlet Process Mixture, we draw the parametersof a mixture model from a draw from a DP:GG DP(· G0, α)θi G(·)θixi p(· θi)xinFor example, if p(· θ) is a Gaussian density with parameters θ, then we have aDirichlet Process Mixture of GaussiansOf course, p(· θ) could be any density.We can derive DPMs from finite mixture models (Neal).

Samples from a Dirichlet Process Mixture of GaussiansN 10N 202200 2 2 202 2N 100200 2 202N 3002 202 202Notice that more structure (clusters) appear as you draw more points.(figure inspired by Neal)

Dirichlet Process Mixtures (Infinite Mixtures)Consider using a finite mixture of K components to model a data setD {x(1), . . . , x(n)}p(x(i) θ) KXπj pj (x(i) θ j )j 1 KXP (s(i) j π) pj (x(i) θ j , s(i) j)j 1Distribution of indicators s (s(1), . . . , s(n)) given π is multinomialKnYXndefP (s(1), . . . , s(n) π) π j j , nj δ(s(i), j) .j 1i 1Assume mixing proportions π have a given symmetric conjugate Dirichlet priorKΓ(α) Y α/K 1p(π α) πΓ(α/K)K j 1 j

Dirichlet Process Mixtures (Infinite Mixtures) - IIDistribution of indicators s (s(1), . . . , s(n)) given π is multinomialKnYXndefP (s(1), . . . , s(n) π) π j j , nj δ(s(i), j) .j 1i 1Mixing proportions π have a symmetric conjugate Dirichlet priorKΓ(α) Y α/K 1p(π α) πΓ(α/K)K j 1 jIntegrating out the mixing proportions, π, we obtainZKYΓ(α)Γ(nj α/K)(1)(n)P (s , . . . , s α) dπ P (s π)P (π α) Γ(n α) j 1 Γ(α/K)

Dirichlet Process Mixtures (Infinite Mixtures) - IIIKStarting fromΓ(α) Y Γ(nj α/K)P (s α) Γ(n α) j 1 Γ(α/K)Conditional Probabilities: Finite KP (s(i) j s i, α) n i,j α/Kn 1 αdefwhere s i denotes all indices except i, and n i,j ( )δ(s, j) 6 iPDP: more populous classes are more more likely to be joinedConditional Probabilities: Infinite KTaking the limit as K yields the conditionals n i,j n 1 αj representedP (s(i) j s i, α) α all j not representedn 1 αLeft over mass, α, countably infinite number of indicator settings.Gibbs sampling from posterior of indicators is often easy!

Approximate Inference in DPMs Gibbs sampling (e.g. Escobar and West, 1995; Neal, 2000; Rasmussen, 2000) Variational approximation (Blei and Jordan, 2005) Expectation propagation (Minka and Ghahramani, 2003) Hierarchical clustering (Heller and Ghahramani, 2005)

Hierarchical Dirichlet Processes (HDP)Assume you have data which is divided into J groups.You assume there are clusters within each group, but you also believe these clustersare shared between groups (i.e. data points in different groups can belong to thesame cluster).HγIn an HDP there is a common DP:G0G0 H, γ DP(· H, γ)GjWhich forms the base measure for a draw from aDP within each groupαφjiGj G0, α DP(· G0, α)xjinjJ

Infinite Hidden Markov ModelsS1S2S3STY1Y2Y3YT Can be derived from the HDP frameworkIn an HMM with K states, the transitionmatrix has K K elements.11stKπk(.)kWe want to let K st-1Kβ γπk α, βθk Hst st 1, (πk ) k 1 yt st, (θk )k 1 Stick(· γ)(base distribution over states) DP(· α, β) (transition parameters for state k 1, . . . ) H(·)(emission parameters for state k 1, . . . ) πst 1 (·)(transition) p(· θst )(emission)(Beal, Ghahramani, and Rasmussen, 2002) (Teh et al. 2004)

Infinite HMM: Trajectories under the prior(modified to treat self-transitions specially)explorative: a 0.1, b 1000, c 100repetitive: a 0, b 0.1, c 100self-transitioning: a 2, b 2, c 20ramping: a 1, b 1, c 10000Just 3 hyperparameters provide: slow/fast dynamicssparse/dense transition matricesmany/few statesleft right structure, with multiple interacting cycles(a)(b)(c)

Polya TreesLet Θ be some measurable space.Assume you have a set Π of nested partitions of the space:Θ B0 B1B0 B1 B0 B00 B01B00 B01 B10 B11 B1 B10 B11etcLet e (e1, . . . , em) be a binary string ei {0, 1}.Let A {αe 0 : e is a binary string} and Π {Be Θ : e is a binary string}DrawYe A Beta(· αe0, αe1)ThenG PT(Π, A)if G(Be) mY Ye1,.,ej 1 j 1:ej 0Actually this is really easy to understand.mYj 1:ej 1 (1 Ye1,.,ej 1 )

Polya TreesYou are given a binary tree dividing up Θ, and positive α’s on each branch of thetree. You can draw from a Polya tree distribution by drawing Beta random variablesdividing up the mass at each branch point.Properties: Polya Trees generalize DPs, a PT is a DP if αe αe0 αe1, for all e. Conjugacy: G PT(Π, A) and θ G G, then G θ PT(Π, A0). Disadvantages: posterior discontinuities, fixed partitions(Ferguson, 1974; Lavine, 1992)

Dirichlet Diffusion Trees (DFT)(Neal, 2001)In a DPM, parameters of one mixture component are independent of anothercomponents – this lack of structure is potentially undesirable.A DFT is a generalization of DPMs with hierarchical structure between components.To generate from a DFT, we will consider θ taking a random walk according to aBrownian motion Gaussian diffusion process. θ1(t) Gaussian diffusion process starting at origin (θ1(0) 0) for unit time. θ2(t), also starts at the origin and follows θ1 but diverges at some time τd, atwhich point the path followed by θ2 becomes independent of θ1’s path. a(t) is a divergence or hazard function, e.g. a(t) 1/(1 t). For small dt:a(t)dtP (θ diverges (t, t dt)) mwhere m is the number of previous points that have followed this path. If θi reaches a branch point between two paths, it picks a branch in proportionto the number of points that have followed that path.

Dirichlet Diffusion Trees (DFT)Generating from a DFT:Figures from Neal, 2001.

Dirichlet Diffusion Trees (DFT)Some samples from DFT priors:Figures from Neal, 2001.

Indian Buffet Processes (IBP)(Griffiths and Ghahramani, 2005)

Priors on Binary Matrices Rows are data pointsColumns are clustersWe can think of CRPs as priors on infinite binary matrices.since each data point is assigned to one and only one cluster (class).the rows sum to one.

More General Priors on Binary Matrices Rows are data pointsColumns are featuresWe can think of IBPs as priors on infinite binary matrices.where each data point can now have multiple features, so.the rows can sum to more than one.

Why? Many unsupervised learning algorithms can be thought of as modelling data interms of hidden variables. Clustering algorithms represent data in terms of which cluster each data pointbelongs to. But clustering models are restrictive, they do not have distributed representations. Consider describing a person as “male”, “married”, “Democrat”, “Red Sox fan”.these features may be unobserved (latent). The number of potential latent features for describing a person (or news story,gene, image, speech waveform, etc) is unlimited.

From finite to infinite binary matriceszik Bernoulli(θk )θk Beta(α/K, 1)α/K Note that P (zik 1 α) E(θk ) α/K 1, soas K grows larger the matrix gets sparser. So if Z is N K, the expected number ofnonzero entries is N α/(1 α/K) N α. Even in the K limit, the matrix isexpected to have a finite number of non-zeroentries.

From finite to infinite binary matricesJust as with CRPs we can integrate out θ, leaving:ZP (Z θ)P (θ α)dθP (Z α) Y Γ(mk kαK )Γ(N α)Γ( Kmk 1)α)Γ(1 Kα)Γ(N 1 KThe conditional assignments are:Z1P (zik θk )p(θk z i,k ) dθkP (zik 1 z i,k ) 0 αm i,k Kα ,N Kwhere z i,k is the set of assignments of all objects, not including i, for feature k,and m i,k is the number of objects having feature k, not including i.

From finite to infinite binary matricesA technical difficulty: the probability for any particular matrix goesto zero as K :lim P (Z α) 0K However, if we consider equivalence classes of matrices in left-ordered form obtainedby reordering the columns: [Z] lof (Z) we get: lim P ([Z] α) exp αHNK Y (N mk )!(mk 1)!αK Q.N!h 0 Kh !k K K is the number of features assigned (i.e. non-zero column sum).PN 1HN i 1 i is the N th harmonic number.Kh are the number of features with history h (a technicality).This distribution is exchangeable, i.e. it is not affected by the ordering onobjects. This is important for its use as a prior in settings where the objects haveno natural ordering.

Binary matrices in left-ordered form(a)(b)lof(a) The class matrix on the left is transformed into the class matrix on the rightby the function lof (). The resulting left-ordered matrix was generated from aChinese restaurant process (CRP) with α 10.(b) A left-ordered feature matrix. This matrix was generated by the Indian buffetprocess (IBP) with α 10.

nleft-orderedformIndian buffet process(b)“Many Indian restaurants in Londonoffer lunchtime buffets with anapparently infinite number of dishes” First customer starts at the left of the buffet, and takes a serving from each dish,medintotheclassmatrixontherightstopping after a Poisson(α) number of dishes as her plate becomes s along the buffet, sampling dishes in proportion towith The ith customer.their popularity, serving himself with probability mk /i, and trying a Poisson(α/i)number of new dishes.trixwasgeneratedbytheIndianbuffet The customer-dish matrix is our feature matrix, Z.

Conclusions We need flexible priors so that our Bayesian models are not based on unreasonableassumptions. Non-parametric models provide a way of defining flexible models. Many non-parametric models can be derived by starting from finite parametricmodels and taking the limit as the number of parameters goes to infinity. We’ve reviewed Gaussian processes, Dirichlet processes, and several otherprocesses that can be used as a basis for defining non-parametric models. There are many open questions:–––––theoretical issues (e.g. consistency)new modelsapplicationsefficient samplersapproximate inference methodshttp://www.gatsby.ucl.ac.uk/ zoubin(for more resources, also to contact meif interested in a PhD or postdoc)Thanks for your patience!

Selected ReferencesGaussian Processes: O’Hagan, A. (1978). Curve Fitting and Optimal Design for Prediction (with discussion). Journalof the Royal Statistical Society B, 40(1):1-42. MacKay, D.J.C. (1997), Introduction to Gaussian y/gpB.pdf Neal, R. M. (1998). Regression and classification using Gaussian process priors (with discussion).In Bernardo, J. M. et al., editors, Bayesian statistics 6, pages 475-501. Oxford University Press. Rasmussen, C.E and Williams, C.K.I. (to be published) Gaussian processes for machinelearningDirichlet Processes, Chinese Restaurant Processes, and related work Ferguson, T. (1973), A Bayesian Analysis of Some Nonparametric Problems, Annals ofStatistics, 1(2), pp. 209–230. Blackwell, D. and MacQueen, J. (1973), Ferguson Distributions via Polya Urn Schemes, Annalsof Statistics, 1, pp. 353–355. Aldous, D. (1985), Exchangeability and Related Topics, in Ecole d’Ete de Probabilites deSaint-Flour XIII 1983, Springer, Berlin, pp. 1–198. Sethuraman, J. (1994), A Constructive Definition of Dirichlet Priors, Statistica Sinica,4:639–650. Pitman, J. and Yor, M. (1997) The two-parameter Poisson Dirichlet distribution derived from astable subordinator. Annals of Probability 25: 855–900.

Ishwaran, H. and Zarepour, M (2000) Markov chain Monte Carlo in approximate Dirichlet andbeta two-parameter process hierarchical models. Biomerika 87(2): 371–390.Polya Trees Ferguson, T.S. (1974) Prior Distributions on Spaces of Probability Measures. Annals ofStatistics, 2:615-629. Lavine, M. (1992) Some aspects of Polya tree distributions for statistical modeling. Annals ofStatistics, 20:1222-1235.Hierarchical Dirichlet Processes and Infinite Hidden Markov Models Beal, M. J., Ghahramani, Z., and Rasmussen, C.E. (2002), The Infinite Hidden Markov Model,in T. G. Dietterich, S. Becker, and Z. Ghahramani (eds.) Advances in Neural InformationProcessing Systems, Cambridge, MA: MIT Press, vol. 14, pp. 577-584. Teh, Y.W, Jordan, M.I, Beal, M.J., and Blei, D.M. (2004) Hierarchical Dirichlet Processes.Technical Report, UC Berkeley.Dirichlet Process Mixtures Antoniak, C.E. (1974) Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2:1152-1174. Escobar, M.D. and West, M. (1995) Bayesian density estimation and inference using mixtures. JAmerican Statistical Association. 90: 577-588. Neal, R.M. (2000). Markov chain sampling methods for Dirichlet process mixture models.Journalof Computational and Graphical Statistics, 9, 249–265. Rasmussen, C.E. (2000). The infinite gaussian mixture model. In Advances in NeuralInformation Processing Systems 12. Cambridge, MA: MIT Press.

Blei, D.M. and Jordan, M.I. (2005) Variational methods for Dirichlet process mixtures. BayesianAnalysis. Minka, T.P. and Ghahramani, Z. (2003) Expectation propagation for infinite mixtures. NIPS’03Workshop on Nonparametric Bayesian Methods and Infinite Models. Heller, K.A. and Ghahramani, Z. (2005) Bayesian Hierarchical Clustering. Twenty SecondInternational Conference on Machine Learning (ICML-2005)Dirichlet Diffusion Trees Neal, R.M. (2003) Density modeling and clustering using Dirichlet diffusion trees, in J. M.Bernardo, et al. (editors) Bayesian Statistics 7.Indian Buffet Processes Griffiths, T. L. and Ghahramani, Z. (2005) Infinite latent feature models and the Indian BuffetProcess. Gatsby Computational Neuroscience Unit Technical Report GCNU-TR 2005-001.Other Müller, P. and Quintana, F.A. (2003) Nonparametric Bayesian Data Analysis.

Non-parametric models are a way of getting very flexible models. Many can be derived by starting with a finite parametric model and taking the limit as number of parameters Non-parametric models can automatically infer an adequate model size/complexity from the data, without needing to explicitly do Bayesian model comparison.2

Related Documents:

parametric models of the system in terms of their input- output transformational properties. Furthermore, the non-parametric model may suggest specific modifications in the structure of the respective parametric model. This combined utility of parametric and non-parametric modeling methods is presented in the companion paper (part II).

Bayesian methods, we provide evidence that Bayesian interval estimators perform well compared to available frequentist estimators, under frequentist performance criteria. The Bayesian non-parametric approach attempts to uncover and exploit structure in the data. For example, if the e

Surface is partitioned into parametric patches: Watt Figure 6.25 Same ideas as parametric splines! Parametric Patches Each patch is defined by blending control points Same ideas as parametric curves! FvDFH Figure 11.44 Parametric Patches Point Q(u,v) on the patch is the tensor product of parametric curves defined by the control points

parametric and non-parametric EWS suggest that monetary expansions, which may reflect rapid increases in credit growth, are expected to increase crisis incidence. Finally, government instability plays is significant in the parametric EWS, but does not play an important role not in the non-parametric EWS.

Computational Bayesian Statistics An Introduction M. Antónia Amaral Turkman Carlos Daniel Paulino Peter Müller. Contents Preface to the English Version viii Preface ix 1 Bayesian Inference 1 1.1 The Classical Paradigm 2 1.2 The Bayesian Paradigm 5 1.3 Bayesian Inference 8 1.3.1 Parametric Inference 8

In general, the semi-parametric and non-parametric methods are found to outperform parametric methods (see Bastos [2010], Loterman et al. [2012], Qi and Zhao [2011], Altman and Kalotay [2014], Hartmann-Wendels, Miller, and Tows [2014], and Tobback et al. [2014]). The papers comparing various parametric methods in the literature, however, are

that the parametric methods are superior to the semi-parametric approaches. In particular, the likelihood and Two-Step estimators are preferred as they are found to be more robust and consistent for practical application. Keywords Extreme rainfall·Extreme value index·Semi-parametric and parametric estimators·Generalized Pareto Distribution

Many community courts handle criminal cases only, but others are experimenting with a broader range of matters, including juvenile delinquency and housing code violations. Some community courts were initiated by courts, and some have been championed by a district attorney. These differences reflect a central aspect of community courts: they focus on neighborhoods and are designed to respond to .