04mle-em

2y ago
19 Views
2 Downloads
4.72 MB
62 Pages
Last View : 3d ago
Last Download : 3m ago
Upload by : Joao Adcock
Transcription

CSE 427Autumn 2021MLE, EM1

OutlineHW#1 DiscussioMLE: Maximum Likelihood EstimatorEM: the Expectation Maximization AlgorithNext: Motif description & discoverymsn2

HW # 1 DiscussionSpeciesNameDescriptionAccess scoreto #1-ionP1517217092 Homo sapiens (Human)TAL1 HUMANT-cell acute lymphocytic leukemia protein 1 (TAL-1)P175421433 Mus musculus (Mouse)MYOD1 MOUSEMyoblast determination protein 1P1008515004 Gallus gallus (Chicken)MYOD1 CHICKMyoblast determination protein 1 homolog (MYOD1 homolog)P1607510205 Xenopus laevis (African clawed frog)MYODA XENLAMyoblast determination protein 1 homolog A (Myogenic factor 1)P139049786 Danio rerio (Zebra sh)MYOD1 DANREMyoblast determination protein 1 homolog (Myogenic factor 1)Q904778937 Branchiostoma belcheri (Amphioxus) Q8IU24 BRABEMyoD-relatedQ8IU244288 Drosophila melanogaster (Fruit y)MYOD DROMEMyogenic-determination protein (Protein nautilus) (dMyd)P228163689 Caenorhabditis elegansLIN32 CAEELProtein lin-32 (Abnormal cell lineage protein 32)Q1057411810 Homo sapiens (Human)SYFM HUMANPhenylalanyl-tRNA synthetase, mitochondrialO9536356flMYOD1 HUMAN Myoblast determination protein 1fi1 Homo sapiens (Human)

ureId 1MDY&bionumber 1

Permutation Score Histogram vs 400600True score is still moderatelyunlikely, one tenth the above.0Frequency8001000Red curve is approx fit of EVD toscore histogram – fit looks better,esp. in tail. Max permuted scorehas probability 10-4, about whatyou’d expect in 2x104 trials.**10012031score

Full pairwise score table, reorderedspecies - hs,mm, gg chick, cl frog, bb amphioxus, fly, elegans

Take Home Recognizable similarity in protein sequencesover great evolutionary distanc S-W p-values allow useful quanti catio Similarity correlates better to “samefunction” than to “same species”nfie8

Learning From Data:MLEMaximum Likelihood Estimators9

Parameter EstimationGiven: independent samples x1, x2, ., xn froma parametric distribution f(x θNot formally “conditional probability,”but the notation is convenient Goal: estimate θE.g.: Given sample HHTTTTTHTHTTTHHof (possibly biased) coin ips, estimateθ probability of Headsf(x θ) is the Bernoulli probability mass function with parameter θ)fl.10

Likelihoo(For Discrete Distributions)P(x θ): Probability of event x given model θViewed as a function of x ( xed θ), it’s a probabilityE.g., Σx P(x θ) 1Viewed as a function of θ ( xed x), it’s called likelihoodE.g., Σθ P(x θ) can be anything; relative values are the focus.E.g., if θ prob of heads in a sequence of coin ips thenP(HHTHH .6) P(HHTHH .5),I.e., event HHTHH is more likely when θ .6 than θ .And what θ make HHTHH most likely?5flfifid11

Likelihood 0.020.20.00θ4(1-θ)maxP( HHTHH Theta)θ0.08P( HHTHH θ ):Probability of HHTHH,given P(H) θ0.00.20.40.6:Theta0.81.0

Maximum LikelihooParameter Estimation(For Discrete Distributions)One (of many) approaches to param. estLikelihood of (indp) observations x1, x2, ., xnnL(x1 , x2 , . . . , xn ) i 1f (xi )(*)As a function of θ, what θ maximizes thelikelihood of the data actually observed?Typical approach: @ @ L( x ) 0 or @ @ log L( x ) 0(*) In general, (discrete) likelihood is the joint pmf; product form follows from independence.d13

Example 1n independent coin ips, x1, x2, ., xn; n0 tails, n1 heads,n0 n1 n; θ probability of headsdL/dθ 0Observed fraction ofsuccesses in sample isMLE of successprobability in population(Also verify it’s max, not min, & not better on boundary)fl14

Likelihoo(For Continuous Distributions)Pr(any speci c xi) 0, so “likelihood probability” won’t work. Defn:“likelihood” of x1, ., xn is their joint density; (by indp) product of theirmarginal densities. (As usual, swap density for pmf.) Why sensiblea) density captures all that matters: relative likelihoob) desirable property: better model t increases likelihooanXµ σ1X XXXc) if density at x is f(x), for any small δ 0, the probability of a sample μwithin δ/2 of x is δf(x), so density really is capturing probability,and δ is constant wrt θ, so it just drops out of d/dθ log L( ) 0-3orOtherwise, MLE is just like discrete case: get likelihood,-2@@ -101log L( x ) 0.:ddfidfid15

Parameter EstimationGiven: indp samples x1, x2, ., xn from aparametric distribution f(x θ), estimate: θE.g.: Given n normal samples,estimate mean & variancef (x) 221/(2 )(xµ) e2 2µ σ (µ, 2 )-2-1μ0116.-3

Ex: I got data; a little birdie tells meit’s normal, and promises σ2 1XX XXX XXXXObserved Datax 17

Which is more likely: (a) this?μ unknown, σ2 1µ 1σX-1μ0X XXX XXXXObserved Data12318

Which is more likely: (b) or this?μ unknown, σ2 1µ σ1XX XXX XXXXObserved Data-3-2-1μ0119

Which is more likely: (c) or this?μ unknown, σ2 1µ σ1X-3-2X XX-1X XXXObserved Dataμ0X12320

Which is more likely: (c) or this?μ unknown, σ2 1Looks good by eye, but how do I optimize my estimate of μ ?µ σ1X-3-2X XX-1X XXXObserved Dataμ0X12321

Ex:xiN (µ, ), 1, µ unknown22nY1p eL(x1 , x2 , . . . , xn ) 2 i 1ln L(x1 , x2 , . . . , xn ) dln L(x1 , x2 , . . . , xn ) d And verify it’s max,not min & not betteron boundarydL/dθ 0 0d lnL/dθ b nXi 1nX(xi )2 /21ln(2 )2(xii 1nXi 1nXi 1xixi product of densities )2(xi2 )!!n 0/n xSample mean is MLE ofpopulation mean22

Ex: I got data; a little birdie tells me it’snormal (but does not tell me μ, σ2)XX XXX XXXXObserved Datax 23

Which is more likely: (a) this?μ, σ2 both unknownμµ 1σX-1μ0X XXX XXXXObserved Data12324

Which is more likely: (b) or this?μ, σ2 both unknownµ σμ 33Xμ01X XXX XXXObserved DataX2325

Which is more likely: (c) or this?μ, σ2 both unknownµμ σ11X-3-2X XX-1X XXXObserved Dataμ0X12326

Which is more likely: (d) or this?μ, σ2 both unknownμ µ σ 0.5XX XXX XXXXObserved Dataμ-3-2-1012327

Which is more likely: (d) or this?μ, σ2 both unknownLooks good by eye, but how do I optimize my estimates of μ & σ2 ?μ µ σ 0.5XX XXX XXXXObserved Dataμ-3-2-1012328

Ex 3:N (µ, ), µ, both unknown2xiln L(x1 , x2 , . . . , xn 1 , 2 )@ln L(x1 , x2 , . . . , xn 1 , 2 )@ 1 b1Likelihoodsurface30.820.610-0.4θ20.4-0.200.2 2nXi 1nX1ln(2 2 )2(xii 1nXi 1 1 ) 2!xi/n(xi 1 ) 22 2 0xSample mean is MLE ofpopulation mean, again0.2θ10.4In general, a problem like this results in 2 equations in 2 unknowns.Easy in this case, since θ2 drops out of the / θ1 0 equation 29

Ex. 3, (cont.)ln L(x1 , x2 , . . . , xn 1 , 2 )@ln L(x1 , x2 , . . . , xn 1 , 2 )@ 2 b2 nXi 1nX1ln(2 2 )2(xi 1 ) 22 21 2 (xi 1 )2 222 2 22i 1 P n b1 )2 /n i 1 (xi 0s2Sample variance is MLE ofpopulation variance30

Ex. 3, (cont.)ln L(x1 , x2 , . . . , xn 1 , 2 ) @ln L(x1 , x2 , . . . , xn 1 , 2 )@ 2 b2 nXi 1nX1ln(2 2 )2(xi 1 ) 22 21 2 (xi 1 )2 222 2 22i 1 P n b1 )2 /n i 1 (xi 0s2A consistent, but biased estimate of population variance.(An example of over tting.) Unbiased estimate is b20 i 1 b1 )2n 1Moral: MLE is a great idea, but not a magic bulletfiI.e., limn correctnX(xi31

MLE SummaryLikelihoodsurface30.82MLE is one way to estimate parameters from dataYou choose the form of the model (normal, binomial, .θMath chooses the value(s) of parameter(s1De ning the “Likelihood Function” (based on the pmf or pdf of the model)is often the critical step; the math/algorithms to optimize it are generi0.610-0.4-0.200.20.20.4Often simply (d/dθ)(log Likelihood(data θ)) 0Has the intuitively appealing property that the parameters maximize thelikelihood of the observed data; basically just assumes your sample is“representative”Of course, unusual samples will give bad estimates (estimate normal human heightsfrom a sample of NBA stars?) but that is an unlikely evenOften, but not always, MLE has other desirable properties like beingunbiased, or at least consistent)t)fi320.4θ2

Conditional Probability&Bayes Rule33

conditional probabilityConditional probability of E given F: probability that E occurs giventhat F has occurred“Conditioning on F”ESFWritten as P(E F)SMeans “P(E has happened, given F observed)” E′Fwhere P(F) 0P(EF) P(E F) P(F).34

law of total probabilityE and F are events in the sample space SE EF EFcSEFEF EFc P(E) P(EF) P(EFc)35

Bayes TheoremMost common form:Expanded form (using law of total probability)Proof::36

The "EM" AlgorithmThe Expectation-Maximization Algorith(for a Two-Component Mixture)m37

Previously:How to estimate μ given dataFor this problem, we got a nice, closedform, solution, allowing calculation of the μ,σ that maximize the likelihood of theobserved dataµ σ1We’re not always so lucky.XX XXX XXXXObserved Data-3-2-1μ01.38

More Complex ExampleThisOr this(A modeling decision, not a math problem.,but if the later, what math?)?39

A Living HistogramTextmale and female genetics students, University of Connecticut in 1996http://mindprod.com/jgloss/histogram.html40

2 Coins: A Binomial MixtureOne fair coin (p(H) 1/2), andone biased coin (p(H) p, xed but unknownFor i 1, 2, , n:pick a coin at random,ip it 10 timesrecord xi # of headExpect histogram ofxi to look like:What is MLE for p?)fisfl41

EM as Chicken vs EggHidden Data: let zi 1 if xi was from biased coin, else IF I knew zi, I could estimate(easy: just use xi s.t. zi 1)Sadly, I know IF I knew p, I could estimate zneither, but (E.g., if p .8, xi 8 implies zi more likely 1;xi 5 implies zi more likely 0;not clear-cut between, but uncertainty is quanti able.The "E-M Algorithm": iterate until convergenceE-step: given (estimated) p, (re)-estimate z’M-step: given (estimated) z’s, (re)-estimate p} Be Optimistic!0)fi:sp42

0 P(0) 1 P(1)The E-StepsyeBaE[z] E 43

Math-Hacking the "if "Let b(x p) binomial prob of x heads in 10 ips whenp(H) pAs above, z 1 if x was biased, elseThen likelihood of x iL(x,z p) "if z 1 then b(x p) else b(x ½)Is there a smoother way? Especially, a differentiable wayYes! Idea #1L(x,z p) z b(x p) (1-z) b(x ½Better still, idea #2L(x,z p) b(x p)z b(x ½)(1-z)equal,ifz is 0/1?"fl)0s::44

The M-Step""p de slieer oush" eviθ" pronlinearity ofexpectation45

An E-M "Alg" for This ProblemInput: x1, xn, 0 xi 1Maybe randomly restarthere a few timesGuessrepeat until convergenceE-step: calculate E[z1], E[zn], using latestM-step: update p using E[zi] (the p that maximizeslikelihood given those E[zi]Alternatively: start by guessing E[zi], then do M-E-M- p)0:p46

Redo the math assuming both coins are biased (butunequallyWrite code to implement either versioOr a spreadsheet, with " ll down" to do a few iterationEven in the 1-coin-biased version, there may be multiplelocal maxima (e.g., consider histogram with a small peak at.25 and large ones at .5 & .8) Does your alg get stuck atlocal max? How often? Does random restart pragmaticallyx this?snfi)47)fiSuggested exercise(s

EM for a Gaussian MixtureI have presented the Gaussian mixture examplein other courses. I will NOT lecture on it in 427,but I’ll leave the slides (48-60) here in case youare interested in seeing another example in detail.Happy to discuss in of ce hours.fi48

Gaussian Mixture Models / Model-based ClusteringParametersmeansvariancesmixing parametersP.D.F.--mLikelihoodseparatelyµ1 12 1µ2 22 2 1 1f (x µ1 , 12 ) f (x µ2 , 22 )togetherLikelihoodL(x1 , x2 , . . . , xn µ1 , µ2 , 12 , 22 , 1 , 2 ){ ni 122 f(x µ, i jj)j 1 jNoclosedformax49

Likelihood Surface0.15200.10.05100-200-100μ1μ2-101020-2050

(-5,12)(-10,6)(12,-5)0.15200.1(6,-10)0.05100-200xi 10.2, 10, 9.8 0.2, 0, 0.211.8, 12, 12.2-100μ1-101020-20μ2σ 2 1.0τ1 .5τ2 .551

Messy: no closed form solution known fornding θ maximizingBut what if weknew thehidden data?52LfiA What-If Puzzle

EM as Egg vs ChickenIF parameters θ known, could estimate zijE.g., xi - µ1 /σ1 xi - µ2 /σ2 P[zi1 1] P[zi2 1]IF zij known, could estimate parameters θE.g., only points in cluster 2 in uence µ2, σ2But we know neither; (optimistically!) iterate:E-step: calculate expected zij, given parametersM-step: calculate “MLE” of parameters, given E(zij)Overall, a clever “hill-climbing” strategyfl53

Nhe ot “lp EMcla ,”rif by utco mnc ayeptsSimple Version:“Classi cation EM”If E[zij] .5, pretend zij 0; E[zij] .5, pretend it’sI.e., classify points as component 1 orNow recalc θ, assuming that partition (standard MLEThen recalc E[zij], assuming that θThen re-recalc θ, assuming new E[zij], etc., etc.“K-meansclustering,”essentially“Full EM” is slightly more involved, (to account foruncertainty in classi cation) but this is the crux.)12fifififiAnother contrast: HMM parameter estimation via “Viterbi” vs “Baum-Welch” training. Inboth, “hidden data” is “which state was it in at each step?” Viterbi is like E-step inclassi cation EM: it makes a single state prediction. B-W is full EM: it captures theuncertainty in state prediction, too. For either, M-step maximizes HMM emission/54transition probabilities, assuming those xed states (Viterbi) / uncertain states (B-W).

Full EMI.e., average over possible, but hidden zij’s55

The E-step:Find E(zij), i.e., P(zij 1)Assume θ known & xeA (B): the event that xi was drawn from f1 (f21)(P1· )(0D: the observed datum xiP·0E Expected value of zi1 is P(A D)}E[zi1] Repeatforeachzi,j)dfiNote: denominator sum of numerators - i.e. that which normalizes sum to 1 (typical Bayes)56

Complete DataLikelihoodequal, if zij are 0/1(Better):57

M-step:Find θ maximizing E(log(Likelihood))wrt dist of zij58

M-step: calculating mu’sIn words: μj is the average of the observed xi’s, weighted bythe probability that xi was sampled from component 80.02109.80.20.70.3117.73.30.20.8193.815.20.03 0.010.97 0.9920 210.6 0.219.4 20.8avg2.913.09901531.02 10.6658.98 19.09new μ’sold E’srow sum59

2 Component Mixtureσ1 σ2 1; τ 0.5Essentially converged in 2 iterations(Excel spreadsheet on course web)60

EM SummaryFundamentally, maximum likelihood parameterestimation; broader than just these exampleE-step: estimate E(z) for each z, given θM-step: estimate θ maximizing E[log likelihood]given E[z] [where “E[logL]” is wrt random z E[z] p(z 1)]MLEIterate:BayesUseful if 0/1 hidden data, and if analysis would bemore tractable if 0/1 hidden data z were knowns61

EM IssuesUnder mild assumptions (e.g., DEKM sect 11.6), EM isguaranteed to increase likelihood with every E-Miteration, hence will convergeBut it may converge to a local, not global, max.(Recall the 4-bump surface.)Issue is intrinsic (probably), since EM is often appliedto NP-hard problems (including clustering, aboveand motif-discovery, soonNevertheless, widely used, often effective,esp. with random restarts.)62

Permutation Score Histogram vs Gaussian score Frequency 40 60 80 100 120 0 200 400 600 800 1000 * ue Score: 7.9 σ * m Score: 5.7 σ 31 Red curve is approx fit of EVD to score histogram – fit looks better, esp. in tail. Max permuted score has probability 10-4, about what you’d expect in 2x104 trials. True score is still moderately unlikely .

Related Documents:

Architectural Graphic Standards , American Institute of Architects, Mar 30, 2007, Architecture, 1080 pages. Since 1932, the ten editions of Architectural Graphic Standards have been referred to as the "architect's bible." From site excavation to structures to roofs, this book is the. Basic construction blueprint reading , Mark W. Huth, 1980, Architecture, 131 pages. Discusses the use of .

AutoCAD workspaces are sets of menus, toolbars and dockable windows (such as the Properties palette, DesignCenter, and the Tool palettes window) that are grouped and organized so that you can work in a custom, task-oriented drawing environment. 1. Click the Workspace Switching icon. 2. Click 3D Basics and OK. AutoCAD 3D Tutorials - 4 - 1.2 3D Basics Interface The following is AutoCAD’s 3D .

Fell in love with Barbara Allan. He sent his man down through the town To the place where she was dwellin’: “O haste and come to my master dear, Gin ye be Barbara Allan.” O slowly, slowly rase she up, To the place where he was lyin’, And when she drew the curtain by: “Young man, I think you’re dyin’.” “O it’s I’m sick, and very, very sick, And ’tis a’ for Barbara .

broadcasting to minority language development what role does Scottish Government see for Gaelic in public service broadcasting? The Chairman welcomed Margaret Mary Murray, BBC ALBA, and Donald Campbell, MG ALBA, to address the conference. 5 R eport from Gaelic Broadcasting Conference, Edinburgh, 15 March 2016 Donald Campbell MG ALBA and BBC ALBA were delighted to hear people’s views. MG ALBA .

multivariate calculus, and analytical geometry. Cognitive 2 1 2 Apply the fundamentals of functions, limits and continuity, derivative, integration, Partial differentiation to engineering problems. Cognitive 3 2 3 Solve problems of analytical geometry using rectangular co-ordinates systems in 3 dimensions. Cognitive 3 2 COURSE CONTENTS: Single Variable Calculus: Basic concept of single .

Chemistry of alkanes: Formation of alkanes, Wurtz Reaction, Wurtz-Fittig Reactions, Free radical substitutions: Halogenation - relative reactivity and selectivity. Stereochemistry (Unit-II) Fischer Projection, Newmann and Sawhorse Projection formulae; Geometrical isomerism: cis-trans and, syn-anti isomerism E/Z notations with C.I.P rules. Optical Isomerism: Optical Activity, Specific Rotation .

(Cohen-Tannoudji and Reynaud 1977a, to be referred to as I) and dealing with the interaction of an intense laser beam with multi-level atoms. We will consider an atomic beam irradiated perpendicularly by the two saturating laser beams so that one can eliminate the Doppler effect.

Common Core State Standards. The three boxes include: Connections to other DCIs in this grade level. This box contains the names of science topics in other disciplines that have related disciplinary core ideas at the same grade level. For example, both Physical Science and Life Science performance expectations contain