MINIMAX ESTIMATION WITH THRESHOLDING AND ITS APPLICATION .

3y ago
18 Views
2 Downloads
592.48 KB
39 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Jamie Paz
Transcription

MINIMAX ESTIMATION WITH THRESHOLDINGAND ITS APPLICATION TO WAVELET ANALYSISHarrison H. Zhou*andJ. T. Gene Hwang**Cornell UniversityMay 1, 2003Abstract. Many statistical practices involve selecting a model (a reduced modelfrom the full model) and then use it to do estimation with possible thresholding.Is it possible to do so and still come up with an estimator always better than thenaive estimator without model selection? The James-Stein estimator allows us todo so. However, the James-Stein estimator considers only one reduced model, theorigin. What should be more desirable is to select a data chosen reduced model(of an arbitrary dimension) and then do estimation with possible thresholding. Inthis paper, we construct such estimators. We apply the estimators to the waveletanalysis. In the finite sample settings, these estimators are minimax and perform thebest among the well-known estimators trying to do model selection and estimation atthe same time. Some of our estimators are also shown to be asymptotically optimal.Key words and phrases: James–Stein estimator, model selection, VisuShrink,Sureshrink, BlockJSAMS 2000 Subject Classification: Primary 62G05, 62J07; Secondary 62C10,62H251. Introduction.In virtually all statistical activities, one constructs a model to summarize thedata. Not only could the model provide a good and effective way of summarizingthe data, the model if correct often provides more accurate prediction. This pointhas been argued forcefully in Gauch (1993). Is there a way to use the data to*Also known as Huibin Zhou.**Also known as Jiunn T. Hwang.Typeset by AMS-TEX

2HARRISON H. ZHOU AND J. T. GENE HWANGselect a reduced model so that if the reduced model is correct the model basedestimator will improve on the naive estimator (constructed using a full model) andyet never do worse than the naive estimator even if the full model is actually theonly correct model? James–Stein estimation (1961) provide such a striking resultunder normality assumption. Any estimator such as the James-Stein estimatorthat does no worse than the naive estimator is said to be minimax. See the precisediscussion right before Lemma 1 of Section 2. The problem with the James–Steinpositive part estimator is however that it selects only between two models: theorigin and the full model. It is possible to construct estimators similar to James–Stein positive part to select between the full model and another linear subspace.However it always chooses between the two. The nice idea of George (1986a,b) inmultiple shrinkage does allow the data to choose among several models; it howeverdoes not do thresholding as is the aim of the paper.In many applications, wavelets is a very important model in statistics. To usethe model, it involves model selection among the full model or the models withsmaller dimensions where some of the wavelet coefficients are zero. Is there a wayto select a reduced model so that the estimator based on it does no worse in anycase than the naive estimator based on the full model, but improves substantiallyupon the naive estimator when the reduced model is correct? Again, the James–Stein estimator provides such a solution. However it selects either the origin orthe full model. Furthermore, the ideal estimator should do thresholding, namelyit should truncate the components which are small and preserves (or shrinks) theother components. However, to the best knowledge of the authors, no such minimaxestimators have been constructed. In this paper, we provide minimax estimators

MINIMAX ESTIMATION WITH THRESHOLDING3which perform thresholding simultaneously.Section 1 through Section 3 develop the new estimator for the canonical formof the model by solving Stein’s differential inequality. Sections 4 and 5 provide anapproximate Bayesian justification and an empirical Bayes interpretation. Sections7 and 8 apply the result to the wavelet analysis. The proposed method outperformsseveral prominent procedures in the statistical wavelet literature.2. New Estimators for a Canonical Model.In this section, we shall consider the canonical form of the problem of a multinormal mean estimation problem under the squared error loss. Hence we shall assumethat our observationZ (Z1 , . . . , Zd ) N (θ, I)is a d–dimensional vector consisting of normal random variable with mean θ (θ1 , . . . , θd ), and a known covariance identity matrix I. The case when the varianceof Zi is not known will be discussed in Section 7.The connection of this problem with wavelet analysis will be pointed out inSections 7 and 8. In short Zi and θi represent the wavelet coefficients of thedata and the true curve in the same resolution, respectively. Furthermore d isthe dimension of a resolution. For now, we shall seek for an estimator of θ basedon Z. We shall, without loss of generality, consider an estimator of this formδ(Z) (δ1 (Z), . . . , δd (Z)), whereδi (Z) Zi gi (Z)where g(Z) : Rd R and search for g(Z) (g1 (Z), . . . , gd (Z)). To insure thatthe new estimator (perhaps with some thresholding) do better than Z (which does

4HARRISON H. ZHOU AND J. T. GENE HWANGno thresholding), we shall compare the risk of δ(Z) to the risk of Z with respect tothe 2 norm. NamelyEkδ(Z) θk E2dX(δi (Z) θi )2 .i 1It is obvious that the risk of Z is then d. We shall say that an estimator strictlydominates the other if the former has a smaller risk for every θ. We shall sayone dominates the other if the former has a risk no greater than the latter forevery θ, but has smaller risk for some θ. Note that Z is a minimax estimator,i.e., it minimizes sup E δ 0 (Z) θ 2 among all δ 0 (Z). Consequently any δ(Z) thatθdominates Z is also minimax.To construct estimator dominates Z, we use the following lemma.Lemma 1. (Stein 1981) Suppose that g : Rd Rd is a measurable function withgi (·) as the ith component. If for every i, gi (·) is almost differentiable with respectto ith component. If ³ E gi (Z) , for i 1, . . . , d ZithenEθ kZ g(Z) θk2 Eθ {d 2 · g(Z) kg(Z)k2 },where · g(Z) dX gi (Z)i 1 Zi. Hence if g(Z) solves the differential inequality2 · g(Z) kg(Z)k2 0,(0)the estimator Z g(Z) strictly dominates Z.Remark: gi (z) is said to be almost differentiable with respect to zi , if for almostall zj , j 6 i, gi (z) can be written as a one dimensional integral of a function with

MINIMAX ESTIMATION WITH THRESHOLDING5respect to zi . For such zj ’s, j 6 i, using Berger’s (1980) terminology, one callsgi (Z) to be absolutely continuous with respect to zi .To motivate the proposed estimator, note that the James–Stein positive estimator has the form³θiJS 1 a ZikZk2 when c max(c, 0) for any number c. This estimator, however, truncates independently of the magnitude of Zi . Indeed, it truncates all or none of the coordinates.To construct an estimator that truncates only the coordinate with small Zi ’s, itseems necessary to replace a by a decreasing function h( Zi ) of Zi and consider³h( Zi ) Ziθbi 1 D where D, independently of i, is yet to be determined. (In a somewhat different approach, Beran and Dümbgen (1998) constructs a modulation estimator corresponding to a monotonic shrinkage factor.) With such a form, θbi 0 if h( Zi ) D,which has a better chance of being satisfied when Zi is small.We consider a simple choice h( Zi ) Zi 2/3 , and find a D Σ Zi 4/3 to solvethe differential inequality (0). This leads to the untruncated version θb with the ithcomponentθbi (Z) Zi gi (Z) where gi (Z) aD 1 sign(Zi ) Zi 1/3 .(1)Here and later sign(Zi ) denotes the sign of Zi . It is possible to use other decreasingfunctions h( Zi ) and other D.In general, we consider, for a fixed β 2, an estimator of the formθbi Zi g(Z),(2)

6HARRISON H. ZHOU AND J. T. GENE HWANGwheresign(Zi ) Zi β 1gi (Z) aDand D dX Zi β .(3)i 1Although at first glance, it may seem hard to justify this estimator, it has a Bayesianand Empricial Bayes justification in Sections 4 and 5. It is also a class of estimatorswhich include, as a special case, the James-Stein estimator corresponding to β 2.Now we havebTheorem 2. For d 2 and 1 β 2, θ(Z)dominates Z if and only if¡ PpEθ D 1 i 1 Zi β 2Pp0 a 2(β 1) inf 2β.(2β 2) )θ Eθ (D 2i 1 Zi Proof: Obviously for Zj 6 0, j 6 i, gi (Z) can be writen as the one–dimensionalintegral of gi (Z) β( a)( 1)D 2 Zi (2β 2) (β 1)( a)D 1 ( Zi β 2 ) Ziwith respect to Zi . (The only concern is at Zi 0.) Consider only nonzero Zj ’s,j 6 i. Since β 1, this function however is integrable with respect to Zi even over an integral including zero.) It takes some effort to prove that E( Zgi (Z) ) .iHowever one only needs to focus on Zj close to zero. Using the spherical–liketransformation r2 P Zi β , we may show that if d 2 and β 1 both terms inthe above displayed expression is integrable.Nowkg(Z)k a D22 2dX Zi 2β 2 .i 1HenceEθ kZ g(Z) θk2 d, for every θ,

MINIMAX ESTIMATION WITH THRESHOLDING7if and only if,Eθ {2 · g(Z) kg(Z)k2 } 0, for every θ,i.e.,ddd³ ³ XXXEθ a (2β)D 2 Zi (2β 2) (2β 2)D 1 Zi β 2 a2 D 2 Zi 2β 2 0,i 1i 1i 1for every θ,(4) which is equivalent to the condition stated in the Theorem.bTheorem 3. The estimator θ(Z)with the ith component given in (2) and (3)dominates Z provided 0 a 2(β 1)d 2β and 1 β 2.Proof: By the correlation inequalitydd³X Zi 2β 2 i 1Henced³X Zi (β 2)d ³ Xi 1 Zi β .i 1¡ PdPEθ D 1 i 1 Zi β 2Eθ D 1 Zj β 2P 1 d.Pd 1 Zi β 2Eθ (D 2 i 1 Zi 2β 2 )d Eθ DHence if 0 a 2(β 1)d 2β, then the condition in Theorem 2 is satisfied,bimplying domination of θ(Z)over Z.The following theorem is a generalization of Theorem 6.2 on page 302 of Lehmann(1983) and Theorem 5.4 on page 356 of Lehmann and Casella (1998). It shows thattaking the positive part will improve componentwise. Specifically for an estimator(θe1 (Z), . . . , θed (Z)) whereθei (Z) (1 hi (Z))Zi ,the positive part estimator of θei (Z) is denoted asθei (Z) (1 hi (Z)) Zi .

8HARRISON H. ZHOU AND J. T. GENE HWANGTheorem 4. Assume that hi (Z) is symmetric with respect to the ith coordinate,thenEθ (θi θei )2 Eθ (θi θei )2 .Furthermore, ifPθ (hi (Z) 1) 0,(5)thenEθ (θi θei )2 Eθ (θi θei )2 .Proof: Simple calculation shows thatEθ (θi θei )2 Eθ (θi θei )2 Eθ ((θei )2 θei2 ) 2θi Eθ (θei θei ).(6)Let’s calculate the expectation by conditioning on hi (Z). For hi (Z) 1, θei θei .Hence it is sufficient to condition on hi (z) b where b 1 and show thatEθ ((θei )2 θei2 hi (Z) b) 2θi Eθ (θei θei hi (Z) b) 0,or equivalently, Eθ (θei2 hi (Z) b) 2θi Eθ (θei hi (Z) b) 0.Obviously, the last inequality is satisfied if we can showθi Eθ (θei hi (Z) b) θi (1 b)Eθ (Zi hi (Z) b) 0,or equivalentlyθi Eθ (Zi hi (Z) b) 0.We may further condition on Zj zj for j 6 i and it suffices to establishθi Eθ (Zi hi (Z) b, Zj zj , j 6 i) 0.(7)

MINIMAX ESTIMATION WITH THRESHOLDING9Given that Zi zj , j 6 i, consider only the case where hi (Z) b has solutions.Due to symmetry of hi (Z), these solutions are in pairs. Let yk , k K, denotethe solutions. Hence the left hand side of (7) equalsθi Eθ (Zi Zi yk , k K) Xθi Eθ (Zi Zi yk )Pθ (Zi yk Zi yk , k K).k KNote thatθi Eθ (Zi Zi yk ) θi yk eyk θi θi yk e yk θi,eyk θi e yk θi(8)which is symmetric in θi yk and is increasing for θi yk 0. Hence (8) is boundedbelow by zero, a bound obtained by substituting θi yk 0 in (8). Consequently weeestablish that (6) is nonpositive, implying the domination of θe over θ.The strict inequality of the theorem can be established by noting that the righthand side of (6) is bounded above by Eθ [(θei )2 θei2 ] which by (5) is strictly negative.Theorem 4 implies the following Corollary.Corollary 5. Under the assumption of Theorem 3, Z is dominated by θb which inturn, is strictly dominated by its positive part θb with ith componentθbi (1 aD 1 Zi β 2 ) Zi .(9)It is interesting to note that estimator (9), for β 2, does give zero as the estimator when Zi are small. When applied to the wavelet analysis, it truncates thesmall wavelet coefficients and shrinks the large wavelet coefficients. The estimatorlies in a data chosen reduced model.Moreover, for β 2, Theorem 3 reduces to the classical result of Stein (1981)and (9) to the positive part James-Stein estimator. The above bound of a for

10HARRISON H. ZHOU AND J. T. GENE HWANGdomination stated in Theorem 3 works only if β 1 and d β/(β 1). Althoughwe cannot provide a domination result for β 1, it does not mean that such a resultis impossible. We are particularly interested in β 12 , since in our experiences withwavelet analysis, β may sometimes be below 1 and is usually large than12.Theasymptotic result in Section 8 only assumes that β 0.Section 3. What is the Largest Possible a?In wavelet analysis, for a reasonable smooth function, a vast majority of thewavelet coefficients are zero. Based on such information, it seems reasonable tochoose an estimator that shrinks the most as long as it does not overshrink. Overshrinking can be prevented as long as the resultant estimator dominates Z. Hencein this section we shall set out to find the largest possible a. The pursuit also yieldsdomination result for12 β 1. Since ultimately we will recommend the positivepart estimator, the reduction in risk will be maximized for small θi ’s, a situationthat happens often in the wavelets analysis.To investigate the largest possible shrinkage, we evaluate the Bayes risk of θb in(2) and (3), assuming that θi are i.i.d. N (0, τ 2 ). Note that the difference of theBayes risk of θb and Z equals ED, whereD dX¡d X(2(Zi θi )gi (Z) gi2 (Z)),(Zi gi (Z) θi )2 (Zi θi )2 i 1i 1andgi (Z) a sign(Zi )D 1 Zi β 1 .To calculate the expectation with respect to Zi and θi , we first calculate the con-

MINIMAX ESTIMATION WITH THRESHOLDINGditional expectation given Zi . Since E(θi Zi ) ED EE(D Z1 , . . . , Zp ) Ed h³X2i 1τ 2 Zi1 τ 2 ,11we obtaini Zi2g(Z) g(Z)ii1 τ2d h³X 2a Zi βa2 Zi 2β 2 i E(1 τ 2 )DD2i 1 Ed³ a2 XD2 Zi 2β 2 i 12a .(1 τ 2 )Note that D 0 if0 a 2(1 τ 2 ) ¡ PdE D12 i 1 Zi 2β 2(10)where the expectation is taken over Zi which are i.i.d. andZi N (0, 1 τ 2 ). Let ξi Zi / 1 τ 2 and consequently ξi N (0, 1). We see that condition (10) isequivalent to³ E Pp ξ 2β 2 i 1 iP0 a aB 2/.( ξi β )2(11)Hence we have the following theorem.Theorem 6. Assume the prior distribution that θi are i.i.d. N (0, τ 2 ). Then theBayes risk of θb is no greater than Z (Z1 , . . . , Zp ) for every τ 2 if and only if0 a aB where aB is defined in (11).Obviously the bound aB is a necessary bound for θb to dominate Z. Our numericalstudies not reported here, however, show that it is sufficient for the domination ofθb and hence θb over Z by Theorem 4.There is a good reason for the domination result of θb when a aB intuitively.Note that for every τ 2 , and in particular for τ 2 , Theorem 6 implies that θb

12HARRISON H. ZHOU AND J. T. GENE HWANGb this implies that θb tendhas Bayes risk no greater than Z. Since θb dominates θ,to have smaller risk than Z for large θ. However θb shrinks Z toward the origin,it seems intuitively reasonable that it should have smaller risk than Z for small θ.Consequently its risk should have a good chance to be no greater than Z for all θ.This is similar to the argument of tail minimaxity of Berger (1976).The normal assumption of θi seems limited. However, the domination result ofTheorem 6 holds for many other distributions. Indeed for any variance mixtureof normal, i.e., taking τ 2 to be random with an arbitrary distribution, Theorem 6holds. A special case of variance mixture of normal is the multivariate t distribution.That is, θi has the same distribution as ξi /S. Here, as before, ξi are i.i.d. standardnormal and S independent of ξi ’s, has the same distribution aspχ2N /N where χ2Nis a chi–squared random variable with N degrees of freedom.What is the bound aB ? It is easy to numerically calculate the bound aB bysimulating ξi ten thousand times and evaluate aB . Figure 1 below shows that,for β 4/3, aB is at least as big as53 (d 2) for virtually all d, since the ratioof aB to the latter, which is plotted in Figure 1, is always larger than one. Thisbound53 (d 2) is more than twice as big as the sufficient bound for β 43givenin Theorem 3.Putting all these together, we come to the conclusion that the estimator θb , withith component 2/3 5³ 3 (d 2)Zibθi 1 PdZi ,4/3 i 1 Zi(12)should have risk smaller than d. For d 50, it is shown that θb dominates Z inFigure 2. This estimator when applied to wavelet examples in Section 7 usually

MINIMAX ESTIMATION WITH THRESHOLDINGproduce risks smaller than θe with a 23 (d13 4), the bound given in Theorem 3and Corollary 5 for β 4/3. Also θe with a larger shrinkage factor a 63 (d 2) 2(d 2) does not do as well for the examples of Section 5 either. This seems to haveovershrunk Z. It is interesting that the criterion of dominating Z does provide avery useful guidance in choosing a. Also using the largest possible a for dominationleads to the best choice especially in the situation that most of θi ’s are zero as inthe wavelet case.It would be convenient to have an approximate formula for the upper bound aβfor every β. It seems tempting to derive the asymptotic limit of aβ /d as d ,which, for12 β 2, equalsCβ 2/(E ξi 2β 2 2¡ ¡4 Γ β 12/(E ξi ) ) ¡ 2β 1 .πΓ 2β 2(13)It may seem tempting to use Cβ (d 2). For the case of β 4/3, this is about(5.17)/[3(d 2)] rather than 5/[3(d 2)] as suggested by (12). Note that 97% of(5.17)/3 is approximately 5/3. Hence we end up with the suggested formulaa 0.97Cβ (d 2).(14)Although this formula is suggested by β 4/3, further numerical investigation notreported here shows that using (14) for a in (9) leads to a θb that dominates Z.4. Approximate Bayesian Justification.It would seem interesting to justify the proposed estimation from a Bayesian’spoint of view. To do so, we consider a prior of the formπ(θ) 1kθkβ 1 1/(kθkβ )βc , kθkβ 1

14HARRISON H. ZHOU AND J. T. GENE HWANGPwhere kθkβ ( kθi kβ )1/β , and c is a positive constant which can be specified tomatch the constant a in (9). In general the Bayes estimator is given byZ log m(Z)where m(Z) is the marginal probability density function of Z. Namely,ZZ···m(Z) e 2 kZ θk π(θ)dθ.( 2π)d21The following approximation follows from Brown (1971), which asserts that log m(Z) can be approximated by log π(Z). The proof is given in the Appendix.Theorem 7. With π(θ) and m(X) given above, i log m(Z) 1. Zi i log π(Z)limHence by Theorem 7, the ith component of the Bayes estimator equals approximatelyZi i log π(Z) Zi cβ Zi β 1 sign(Zi )P. Zi βThis is similar to the untruncated version of θb in (2) and (3).5. Empirical Bayes Justification.Based on several signals and images, Mallat (1989) proposed a prior for thewavelelet coefficients θi as the exponential power distribution with the probabilitydensity function (p.d.f.) of the formθi βf (θi ) ke α where α and β 2 are positive constants andk β/(2αΓ(1/β))(15)

MINIMAX ESTIMATION WITH THRESHOLDING15is the normalization constant. See also Vidakovic (1999, p.194). Using method ofmoments, Mallat estimated value of α and β to be 1.39 and 1.14 for a particulargraph. However, α and β are typical unknown.It seems reasonable to derive an Empirical Bayes estimator based on this classof prior distributions. First we assume that α is known. Then the Bayes estimatorof θi isZi log m(Z). ZiSimilar to the argument in Theorem 7 and noting that for β 2,e θi Zi β/αβ/e θi β/αβ 1 as θi ,the Bayes estimator can be approximated byZi βlog π(Zi ) Zi β Zi β 1 sign

AND ITS APPLICATION TO WAVELET ANALYSIS Harrison H. Zhou* and J. T. Gene Hwang** Cornell University May 1, 2003 Abstract. Many statistical practices involve selecting a model (a reduced model from the full model) and then use it to do estimation with possible thresholding. Is it possible to do so and still come up with an estimator always .

Related Documents:

Spring 2016. í. MIT 18.655 Minimax Procedures. Minimax Procedures Decision-Theoretic Framework Game Theory Minimax Theorems. Outline. 1. Minimax

Bending moment is weighted heavily in the cost items for the minimax com- putations. A comparison of control costs for specific drift-minimum con- trollers and minimax controllers is presented. It is found that the minimax controllers have lo

A spreadsheet template for Three Point Estimation is available together with a Worked Example illustrating how the template is used in practice. Estimation Technique 2 - Base and Contingency Estimation Base and Contingency is an alternative estimation technique to Three Point Estimation. It is less

Introduction The EKF has been applied extensively to the field of non-linear estimation. General applicationareasmaybe divided into state-estimation and machine learning. We further di-vide machine learning into parameter estimation and dual estimation. The framework for these areas are briefly re-viewed next. State-estimation

study makes explicit the deep links between model singularities, parameter estimation rates and minimax bounds, and the algebraic geometry of the parameter space for mixtures of continuous distributions. The theory is applied to establish concrete convergence rates of parameter estimation for finite mixture of skewnormal distributions.

characteristics and giving less noise image. 2. Objectives and Tools Employed. 2.1. Objective of the project . The main objective of this paper is study various thresholding techniques such as Sure Shrink, Visu Shrink and Bayes Shrink and determine the best one for image denoising. 2.2. Tools Used . Software: MATLAB . 3. Types of Noise . 3.1.Cited by: 1Publish Year: 2013Author: Shivani Mupparaju, B Naga Venkata Satya Durga Jahnavi

Keywords: Medical Image Denoising, Multiscale Transforms, Shrinkage Thresholding. 1. Introduction Medical imaging has become new research focus area and is playing a significant role in diagnosing diseases. There are many imaging modalities for different applications. All these modalities will introduce some amount of noise like Gaussian,

These have been graded as Very Easy (Level 1), Easy (Level 2), and Standard (Level 3). Penguin Readers/Pearson English Readers Penguin Readers and Pearson English Readers have a large selection of genres across 7 levels – Easystart to Level 6. CDs are included with some titles. Level Headwords CEFR Easystart 200 A1 Level 1 300 A1