Sequential Complexities And Uniform Martingale Laws Of .

2y ago
29 Views
2 Downloads
564.89 KB
37 Pages
Last View : 18d ago
Last Download : 3m ago
Upload by : Azalea Piercy
Transcription

Noname manuscript No.(will be inserted by the editor)Sequential Complexities and Uniform MartingaleLaws of Large NumbersAlexander Rakhlin Karthik Sridharan Ambuj TewariReceived: date / Accepted: dateAbstract We establish necessary and sufficient conditions for a uniform martingale Law of Large Numbers. We extend the technique of symmetrization tothe case of dependent random variables and provide “sequential” (non-i.i.d.)analogues of various classical measures of complexity, such as covering numbers and combinatorial dimensions from empirical process theory. We establishrelationships between these various sequential complexity measures and showthat they provide a tight control on the uniform convergence rates for empirical processes with dependent data. As a direct application of our results, weprovide exponential inequalities for sums of martingale differences in Banachspaces.Keywords empirical processes dependent data uniform Glivenko-Cantelliclasses Rademacher averages sequential predictionMathematics Subject Classification (2000) 60E15 60C05 60F15 91A20Alexander RakhlinUniversity of Pennsylvania, 3730 Walnut St, Philadelphia, PA 19103E-mail: rakhlin@wharton.upenn.eduKarthik SridharanUniversity of Pennsylvania, 3730 Walnut St, Philadelphia, PA 19103E-mail: skarthik@wharton.upenn.eduAmbuj TewariUniversity of Michigan, 439 West Hall, 1085 South University, Ann Arbor, MI 48109E-mail: tewaria@umich.edu

2Alexander Rakhlin et al.1 IntroductionLet (Ω, A, P) be an arbitrary complete probability space. Let Z be a separable metric space and F {f Z R} be a set of bounded real valued functions on Z. Consider independent and identically distributed random variablesZ1 , . . . , Zn , . . . in Z with the common distribution P. The empirical processindexed by f F is defined asf Gn (f ) 1 n (Ef (Z) f (Zt )) .n t 1The study of the behavior of the supremum of this process is a central topic inempirical process theory, and it is well known that this behavior depends onthe “richness” of F. Statements about convergence of the supremum to zeroare known as uniform Laws of Large Numbers (LLN). More precisely, a class Fis said to be (strong) Glivenko-Cantelli for the distribution P if the supremumof Gn (f ) converges to zero almost surely as n . Of particular interest areclasses for which this convergence happens uniformly for all distributions. Aclass F is said to be uniform Glivenko-Cantelli if δ 0, limsup P (sup sup Gn (f ) δ) 0′n Pn n′ f F(1)where P is the product measure P . As a classical example, consider i.i.d. realvalued random variables Z1 , . . . , Zn and a class F {z 1 {z θ} θ R},where 1 {} is the indicator function. For this class, (1) holds by the well knownresults of Glivenko and Cantelli: almost surely, the supremum of the differencebetween the cumulative distribution function and the empirical distributionfunction converges to zero. A number of necessary and sufficient conditions forthe Glivenko-Cantelli and the uniform Glivenko-Cantelli properties have beenderived over the past several decades [11].In this paper, we are interested in the martingale analogues of the uniformLLN, as well as in the analogues to the various notions of complexity thatappear in empirical process theory. Specifically, consider a sequence of randomvariables (Zt )t 1 adapted to a filtration (At )t 1 . We are interested in thefollowing process indexed by f F:f Mn (f ) 1 n (E[f (Zt ) At 1 ] f (Zt )) .n t 1The central object of study in this paper is the supremum of the processMn (f ), and in particular we address the question of whether a uniform convergence similar to (1) holds. Evidently, Mn (f ) coincides with Gn (f ) in thecase when Z1 , Z2 , . . . are i.i.d. random variables. More generally, for any fixedf F, the sequence (E [f (Zt ) At 1 ] f (Zt ))t 1 is a martingale differencesequence. Similar to the notion of uniform Glivenko-Cantelli class F, we candefine the notion of uniform convergence for dependent random variables overfunction class F as follows.

Sequential Complexities and Uniform Martingale Laws of Large Numbers3Definition 1 A function class F satisfies Sequential Uniform Convergence if, δ 0, limsup P (sup sup Mn (f ) δ) 0 ,′n Pn n′ f F(2)where the supremum is over all distributions P on the space (Ω, A).The gap between properties (1) and (2) is already witnessed by the exampleof the class F {z 1 {z θ} θ R} of functions on R, discussed earlier. Incontrast to the uniform Glivenko-Cantelli property, the martingale analogue(2) does not hold for this class. On the positive side, the necessary and sufficientconditions for a class F to satisfy sequential uniform convergence, as derivedin this paper, can be verified for a wide range of interesting classes.2 Summary of the ResultsOne of the main results in this paper is the following equivalence.Theorem 1 Let F be a class of [ 1, 1]-valued functions. Then the followingstatements are equivalent.1. F satisfies Sequential Uniform Convergence.2. For any α 0, the sequential fat-shattering dimension fatα (F) is finite.3. Sequential Rademacher complexity Rn (F) satisfies limn Rn (F) 0.Theorem 1 yields a characterization of the uniform convergence property interms of two quantities. The first one is a combinatorial “dimension” of theclass at scale α (Definition 7). The second is a measure of complexity of theclass through random averages (Definition 3). In addition to these quantities,we define sequential versions of covering numbers and the associated Dudleytype entropy integral. En route to proving Theorem 1, we obtain key relationships between the introduced covering numbers, the combinatorial dimensions,and random averages. These relationships constitute the bulk of the paper, andcan be considered as martingale extensions of the results in empirical processtheory. Specifically, we show– A relationship between the empirical process with dependent random variables and the sequential Rademacher complexity (Theorem 2), obtainedthrough sequential symmetrization.– An upper bound of sequential Rademacher complexity by a Dudley-typeentropy integral through the chaining technique (Theorem 3).– An upper bound on sequential covering numbers in terms of the combinatorial dimensions (Theorems 4 and 5, as well as Corollary 1). In particular,Theorem 5 is a sequential analogue of the celebrated Vapnik-ChervonenkisSauer-Shelah lemma.

4Alexander Rakhlin et al.– A relationship between the combinatorial dimension and sequential Rademacher complexity (Lemma 2) and, as a consequence, equivalence of manyof the introduced complexity notions up to a poly-logarithmic factor.– Properties of sequential Rademacher complexity and, in particular, thecontraction inequality (Lemma 7).– An extension of the above results to high-probability statements (Lemmas4, 5, and 6) and an application to concentration of martingales in Banachspaces (Corollary 2).This paper is organized as follows. In the next section we place the presentpaper in the context of previous work. In Sections 4-6 we introduce sequentialcomplexities. A characterization of sequential uniform convergence appears inSection 7. We conclude the paper with some structural results in Section 8 andan application to exponential inequalities for sums of martingale differencesequences in Banach spaces in Section 9. Most proofs are deferred to theappendix.3 Related LiteratureThe seminal work of Vapnik and Chervonenkis [37] provided the first necessary and sufficient conditions – via a notion of random VC entropy – for aclass F of binary valued functions to be a Glivenko-Cantelli (GC) class. Theseresults were strengthened by Steele [29], who showed almost sure convergence.A similar characterization of the GC property via a notion of a covering number in the case of uniformly bounded real-valued functions appears in [38].For the binary-valued case, a distribution-independent version of the VC entropy (termed the growth function) was shown by Vapnik and Chervonenkis[37] to yield a sufficient condition for the uniform GC property. The “necessary” direction was first shown (according to [11, p. 229]) in an unpublishedmanuscript of Assouad, 1982. For real-valued classes of functions, the necessaryand sufficient conditions for the uniform GC property were established in [12]through a notion of a covering number similar to the Koltchinskii-Pollard entropy. A characterization of GC classes for a fixed distribution were also givenby Talagrand [30, 31] through a notion of a “witness of irregularity”. Similarin spirit, the pseudo-dimension introduced in [24] was shown by Pollard to besufficient, though not necessary, for the uniform GC property. A scale-sensitiveversion of pseudo-dimension (termed the fat-shattering dimension by [5]) wasintroduced by Kearns and Schapire [16]. Finiteness of this dimension at allscales was shown in [3] to characterize the uniform GC classes. We refer thereader to [11, Chapter 6] and [33, 34] for a much more detailed account of theresults.The GC-type theorems have also been extended to the case of weakly dependent random variables. For instance, Yukich [40] relies on a φ-mixing assumption, while Nobel and Dembo [21] and Yu [39] consider β-mixing sequences.

Sequential Complexities and Uniform Martingale Laws of Large Numbers5For a countable class with a finite VC dimension, a GC theorem has beenrecently shown by Adams and Nobel [2] for ergodic sequences. We refer thereader to [2, 10] for a more comprehensive survey of results for non-i.i.d. data.Notably, the aforementioned papers prove a GC-type property under much thesame type of complexity measures as in the i.i.d. case. This is in contrast tothe present paper, where the classical notions do not provide answers to thequestions of convergence.In this paper, we do not make mixing or ergodicity assumptions on the sequence of random variables. However, the definition of Mn (f ) imposes a certain structure which is not present when an average is compared with a single expected value. Thus, our results yield an extension of the GC propertyto non-i.i.d. data in a direction that is different from the papers mentionedabove. Such an extension has already been considered in the literature: thequantity supf F Mn (f ) has been studied by S. van de Geer [34] (see Chapter 8.2). Dudley integral type upper bounds for a given distribution P wereprovided in terms of the so called generalized entropy with bracketing, corresponding to the particular distribution P. This is a sufficient condition forconvergence of the supremum of Mn (f ) for the given distribution. In this work,however, we are interested in providing necessary and sufficient conditions forthe uniform analogue of the GC property, as well as in extending the ideas ofsymmetrization, covering numbers, and scale-sensitive dimensions to the noni.i.d. case. Towards the end of Section 7, we discuss the relationship betweenthe generalized entropy with bracketing of [34] and the tools provided in thiswork.We also stress that this paper studies martingale uniform laws of large numbers rather than a convergence of nMn (f ), which only holds under stringentconditions; such a convergence for reverse martingaleshas been studied in [36]. The question of the limiting behavior of nMn (f ) (that is, the analogue ofthe Donsker property [11]) is also outside of the scope of this paper.The study of the supremum of the process Mn (f ) has many potential applications. For instance, in [35], the quantity supf F Mn (f ) is used to providebounds on estimation rates for autoregressive models. In [1, 25] connectionsbetween minimax rates of sequential prediction problems and the supremumof the process Mn (f ) over the associated class of predictors F are established.In Section 9 of this work, we show how the supremum of Mn (f ) over class oflinear functionals can be used to derive exponential inequalities for sums ofmartingale differences in general Banach spaces.4 Symmetrization and the Tree ProcessA key tool in deriving classical uniform convergence theorems (for i.i.d. random variables) is symmetrization. The main idea behind symmetrization is tocompare the empirical process Gn (f ) over a probability space (Ω, A, P) to a

6Alexander Rakhlin et al.symmetrized empirical process, called the Rademacher process, over the probability space (Ω , B, P ) where Ω { 1, 1}N , B the Borel σ-algebra and P the uniform probability measure. We use the notation E to represent expectation under the measure P , and (Bt )t 0 to denote the dyadic filtration on Ω given by Bt σ( 1 , . . . , t ), where t ’s are independent symmetric { 1}-valuedRademacher random variables and B0 {{}, Ω }.(z)Given z1 , . . . , zn Z, the Rademacher process Sn 1 n (f ) is defined1 as1 n )f S(z(f ) n1 n t f (zt ) .n t 1(3)It is well-known (e.g. [33]) that the behavior of the supremum of the sym(z )metrized process Sn 1 n (f ) is closely related to the behavior of the supremumof the empirical process asE sup Gn (f ) 2 supf Fz1 ,.,zn Z1 n )E sup S(z(f )nf F(4)and a similar high-probability statement can also be proved. Note that theRademacher process is defined on the probability space (Ω , B, P ), which ispotentially easier to handle than the original probability space for the empiricalprocess.In the non-i.i.d. case, however, a similar symmetrization argument requiressignificantly more care and relies on the notion of decoupled tangent sequences[9, Def. 6.1.4]. Fix a sequence of random variables (Zt )t 1 adapted to the filtration (At )t 1 . A sequence of random variables (Zt′ )t 1 is said to be a decoupledsequence tangent to (Zt )t 1 if for each t, conditioned on Z1 , . . . , Zt 1 , the random variables Zt and Zt′ are independent and identically distributed. Thus,the random variables (Zt′ )t 1 are conditionally independent given (Zt )t 1 . InTheorem 2 below, a sequential symmetrization argument is applied to the decoupled sequences, leading to a tree process – an analogue of the Rademacherprocess for the non-i.i.d. case. First, let us define the notion of a tree.A Z-valued tree z of depth n is a rooted complete binary tree with nodeslabeled by elements of Z. We identify the tree z with the sequence (z1 , . . . , zn )of labeling functions zi { 1}i 1 Z which provide the labels for each node.Here, z1 Z is the label for the root of the tree, while zi for i 1 is the labelof the node obtained by following the path of length i 1 from the root, with 1 indicating ‘right’ and 1 indicating ‘left’. A path of length n is given by thesequence ( 1 , . . . , n ) { 1}n . For brevity, we shall often write zt ( ), butit is understood that zt only depends on the prefix ( 1 , . . . , t 1 ) of . Given atree z and a function f Z R, we define the composition f z as a real-valuedtree given by the labeling functions (f z1 , . . . , f zn ).1 For integers a b, we denote a sequence of the form (y , . . . , y ) by yaba b . For any n N,we use [n] to denote the set {1, . . . , n}.

Sequential Complexities and Uniform Martingale Laws of Large Numbers7Observe that if 1 , . . . , n are i.i.d. Rademacher random variables, thenn( t f (zt ( 1 , . . . , t 1 )))t 1is a martingale-difference sequence for any given function f Z R.Definition 2 Let 1 , . . . , n be independent Rademacher random variables.Given a Z-valued tree z of depth n, the stochastic processf T(z)n (f ) 1 n t f (zt ( 1 , . . . , t 1 ))n t 1will be called the tree process indexed by F.(z)We may view the tree process Tn (f ) as a generalization of the Rademacher(z )process Sn 1 n (f ). Indeed, suppose (z1 , . . . , zn ) is a sequence of constant labeling functions such that for any t [n], zt ( 1 , . . . , t 1 ) zt for any ( 1 , . . . , t 1 ).(z)(z )In this case, Tn (f ) and Sn 1 n (f ) coincide. In general, however, the tree process can behave differently (in a certain sense) from the Rademacher process.Given z1 , . . . , zn , the expected supremum of the Rademacher process in (4) isknown as (empirical) Rademacher averages or Rademacher complexity of thefunction class. We propose the following definition for the tree process:Definition 3 The sequential Rademacher complexity of a function class F RZ on a Z-valued tree z is defined asRn (F, z) E sup T(z)n (f ) E [supf Ff F1 n t f (zt ( ))]n t 1andRn (F) sup Rn (F, z)zwhere the outer supremum is taken over all Z-valued trees of depth n, and ( 1 , . . . , n ) is a sequence of i.i.d. Rademacher random variables.Theorem 2 The following relation holds between the empirical process withdependent random variables and the sequential Rademacher complexity:E sup Mn (f ) 2 Rn (F) .f F(5)Furthermore, this bound is tight, as we have1B(Rn (F) ) sup E sup Mn (f )22 nPf Fwhere B inf z Z supf,f ′ F (f (z) f ′ (z)) 0.(6)

8Alexander Rakhlin et al.We would like to point out that in general Rn (F) Ω ( Bn ), and so in theworst case the behavior of the expected supremum of Mn is precisely givenby the sequential Rademacher complexity. Further, we remark that for a classF of linear functions on some subset Z of a vector space such that 0 Z, wehave B 0 and the lower bound becomes 12 Rn (F).The proof of Theorem 2 requires more work than the classical symmetrizationproof [14, 11, 19] due to the non-i.i.d. nature of the sequences. To readersfamiliar with the notion of martingale type in the theory of Banach spaces wewould like to point out that the tree process can be viewed as an analogue ofWalsh-Paley martingales. The upper bound of Theorem 2 is a generalizationof the fact that the expected norm of a sum of martingale difference sequencescan be upper bounded by the expected norm of sum of Walsh-Paley martingaledifference sequences, as shown in [23].As mentioned earlier, the sequential Rademacher complexity is an object thatis easier to study than the original empirical process Mn . The following sectionsintroduce additional notions of complexity of a function class that providecontrol of the sequential Rademacher complexity. Specific relations betweenthese complexity notions will be shown, leading to the proof of Theorem 1.5 Finite Classes, Covering Numbers, and ChainingThe first step in upper bounding sequential Rademacher complexity is thefollowing result for a finite collection of trees.Lemma 1 For any finite set V of R-valued trees of depth n we have that¿nnÁÀ2 log( V ) max max vt ( )2E [max t vt ( )] Áv V t 1v V { 1}n t 1where V denotes the cardinality of the set V .A simple consequence of the above lemma is that if F [ 1, 1]Z is a finiteclass, then for any tree z, we have that 1 n1 n2 log( F )E [max t f (zt ( ))] E [ max, (7) t vt ( )] f F n t 1v F (z) n t 1nwhere F(z) {f z f F} is the projection of F onto z. It is clear that F(z) F which explains the second inequality above.To illustrate the next idea, consider a binary-valued function class F { 1}Z .In the i.i.d. case, the cardinality of the coordinate projection{(f (z1 ), . . . , f (zn )) f F}

Sequential Complexities and Uniform Martingale Laws of Large Numbers9immediately yields a control of the supremum of the empirical process. For thetree-based definition, however, it is easy to see that the cardinality of F(z) isexponential in n for any interesting class F, leading to a vacuous upper bound.A key observation is that the first inequality in (7) holds with F(z) replacedby a potentially smaller set V of R-valued trees with the property that f F, { 1}n , v V s.t.vt ( ) f (zt ( ))(8)for all t [n]. Crucially, the choice of v may depend on . A set V of Rvalued trees satisfying (8) is termed a 0-cover of F RZ on a tree z of depthn. We denote by N (0, F, z) the size of a smallest 0-cover on z and defineN (0, F, n) supz N (0, F, z).To illustrate the gap between the size of a 0-cover and the cardinality of F(z

empirical process theory, and it is well known that this behavior depends on the \richness" of F. Statements about convergence of the supremum to zero are known as uniform Laws of Large Numbers (LLN). More precisely, a class F is said to

Related Documents:

Real-valued function classes, covering numbers, chaining, fat-shattering dimension Supervised learning : necessary and sufficient conditions for learnability Online learning theory Sequential minimax and value of online learning game Martingale Uniform convergence, sequential empirical process theory Sequential Rademacher complexity

A more detailed account of the relationships between sequential complexity measures along with complete proofs can be found in (Rakhlin et al.,2014). 3. Sequential Complexities Unlike the well-studied statistical learning scenario with i.i.d. data, the online learning

A rigorous analysis of sequential problems is a large part of this course. Interestingly, most research on sequential prediction (or, onlinelearning) has been algorithmic: given a problem, one would present a method and prove a guarantee for its performance. In this course, we present a thorough study of inherent complexities of sequential .

1967 NFPA Pamphlet No. 58 Storage and Handling of Liquefied Petroleum Gases 1965 Uniform Fire Code 1973 Uniform Fire Code 1979 Uniform Fire Code 1982 Uniform FireCode 1997 Uniform Administrative Code 2012 City of Las Vegas Administrative Code Fire Code 1997 Uniform Administrative Code Amen

a sequential pattern; or b) b can be appended to α to form a sequential pattern. 2. For each frequent item b, append it to α to form a sequential pattern α', and output α'; 3. For each α', construct α'-projected database S α', and call PrefixSpan(α', l 1, S α').

The University of Texas at Arlington Sequential Logic - Intro CSE 2340/2140 – Introduction to Digital Logic Dr. Gergely Záruba The Sequential Circuit Model x 1 Combinational z1 x n zm (a) y y Y Y Combinational logic logic x1 z1 x n z m Combinational logic with n inputs and m switching functions: Sequential logic with n inputs, m outputs, r .

complexity is assumed to be independent from the sequen-tial scheduling chosen. Furthermore, different types of se-quential schedules such as sequential check-node updat-ing, sequential variable-node updating and sequential mes-sage updating have very similar performance results [9]. Given their similarities, the different types of sequential

ONLINE REGISTRATION: A STEP-BY-STEP GUIDE CONTENTS OVERVIEW 3 HOW TO LOG IN TO ONLINE REGISTRATION 6 PERSONAL DETAILS 7 1. Personal Information (Gender, Marital Status, Mobile Phone No.) 8 2. Social Background (Occupational Background, No. of Dependants). 9 3. Country of Origin/Domicile 9 4. Home Address 10 5. Term Time Address 11 6. Emergency Contact Details 12 7. Disabilities 14 8. Previous .