Matrix Concentration Inequalities - Princeton University

2y ago
42 Views
2 Downloads
804.19 KB
35 Pages
Last View : 7d ago
Last Download : 2m ago
Upload by : Maleah Dent
Transcription

ELE 520: Mathematics of Data ScienceMatrix concentration inequalitiesYuxin ChenPrinceton University,Fall 2020

Recap: matrix Bernstein inequalityConsider a sequence of independent random matrices Xl Rd1 d2 E[Xl ] 0 kXl k B for each l variance statistic:v : maxnEhXlXl Xl i, EhXlXl Xli oTheorem 3.1 (Matrix Bernstein inequality)For all τ 0,Pn XMatrix concentrationlXl τ 2 /2 τ (d1 d2 ) expv Bτ /3o!3-2

Recap: matrix Bernstein inequalityssian tailexponential tailMatrix concentrationGaussian tailexponential tailGaussian tailexponential tail3-3

This lecture: detailed introduction of matrix BernsteinAn introduction to matrix concentration inequalities— Joel Tropp ’15

Outline Matrix theory background Matrix Laplace transform method Matrix Bernstein inequality Application: random featuresMatrix concentration3-5

Matrix theory background

Matrix functionSuppose the eigendecomposition of a symmetric matrix A Rd d is λ1 A U . U.λdThen we can define f (A) : U f (λ1 ). U.f (λd )Matrix concentration3-7

Examples of matrix functions Let f (a) c0 P kk 1 ck a ,thenf (A) : c0 I Xck Akk 1 matrix exponential: eA : I P 1kk 1 k! AA monotonicity: if A H, then tr e(why?)H tr e matrix logarithm: log(eA ) : A monotonicity: if 0 A H, then log A log(H)Matrix concentration3-8

Matrix moments and cumulantsLet X be a random symmetric matrix. Then matrix moment generating function (MGF):MX (θ) : E[eθX ] matrix cumulant generating function (CGF):ΞX (θ) : log E[eθX ]Matrix concentration3-9

Matrix Laplace transform method

Matrix Laplace transformA key step for a scalar random variable Y : by Markov’s inequality,P {Y t} inf e θt E eθY θ 0This can be generalized to the matrix caseMatrix concentration3-11

Matrix Laplace transformLemma 3.2Let Y be a random symmetric matrix. For all t R,P {λmax (Y ) t} inf e θt E tr eθY θ 0 can control the extreme eigenvalues of Y via the trace of thematrix MGFMatrix concentration3-12

Proof of Lemma 3.2For any θ 0,nP {λmax (Y ) t} P eθλmax (Y ) eθtE[eθλmax (Y ) ]eθtλE[e max (θY ) ] eθtE[λmax (eθY )] eθtE[tr eθY ] eθt o(Markov’s inequality)(eλmax (Z) λmax (eZ ))This completes the proof since it holds for any θ 0Matrix concentration3-13

Issues of the matrix MGFThe Laplace transform method is effective for controlling anindependent sum when MGF decomposes in the scalar case where X X1 · · · Xn with independent{Xl }:MX (θ) E[eθX1 ··· θXn ] E[eθX1 ] · · · E[eθXn ] nYMXl (θ)l 1 {z}look at each Xl separatelyIssues in the matrix settings:eX1 X2 6 eX1 eX2unless X1 and X2 commutetr eX1 ··· Xn tr eX1 eX1 · · · eXnMatrix concentration3-14

Subadditivity of the matrix CGFFortunately, the matrix CGF satisfies certain subadditivity rules,allowing us to decompose independent matrix componentsLemma 3.3Consider a finite sequence {Xl }1 l n of independent randomsymmetric matrices. Then for any θ R,hE tr eθ P{zlXli} tr exp ΞΣl Xl (θ) tr exp X tr explog E eθXl l{zPΞ (θ)l Xl } this is a deep result — based on Lieb’s Theorem!Matrix concentration3-15

Lieb’s TheoremTheorem 3.4 (Lieb ’73)Fix a symmetric matrix H. ThenA 7 tr exp(H log A)is concave on positive-semidefinite coneElliott LiebLieb’s Theorem immediately implies (exercise: Jensen’s inequality)E tr exp(H X) tr exp H log E eX Matrix concentration (3.1)3-16

Proof of Lemma 3.3E tr eθ PlXl Xn 1h l 1Xn 1h Xn 2 E tr exp θ E tr exp θ E tr exp θl 1l 1Xl θXn Xl log E eθXn i(by (3.1))Xl log E eθXn 1 log E eθXn i ··· tr expMatrix concentration Xnl 1log E eθXl 3-17

Master boundsCombining the Laplace transform method with the subadditivity ofCGF yields:Theorem 3.5 (Master bounds for sum of independent matrices)Consider a finite sequence {Xl } of independent random symmetricmatrices. Then PθXl ]noX tr expl log E[eP λmaxXl t inflθ 0eθt this is a general result underlying the proofs of the matrixBernstein inequality and beyond (e.g. matrix Chernoff)Matrix concentration3-18

Matrix Bernstein inequality

Matrix CGFnP λmaxX loXl t infθ 0tr expPlog E[eθXl ]eθtl To invoke the master bound, one needs to controlthe{zmatrix CGF} main step for proving matrix BernsteinMatrix concentration3-20

Symmetric caseConsidera sequence of independent random symmetric matrices Xl Rd d E[Xl ] 0 λmax (Xl ) B for each l variance statistic: v : E Pl 2XlTheorem 3.6 (Matrix Bernstein inequality: symmetric case)For all τ 0,nP λmaxMatrix concentration XlXl τ 2 /2 τ d expv Bτ /3o!3-21

Bounding matrix CGFFor bounded random matrices, one can control the matrix CGF asfollows:Lemma 3.7Suppose E[X] 0 and λmax (X) B. Then for 0 θ 3/B,log E eθX Matrix concentration θ2 /2E[X 2 ]1 θB/33-22

Proof of Theorem 3.6Let g(θ) : nP λmaxθ2 /21 θB/3 ,Xithen it follows from the master bound that oXi t inftr expθ 0Lemma 3.7 θXi ]i 1 log E[eeθt tr exp g(θ) ni 1 E[Xi2 ]eθtPinf0 θ 3/B Pninf0 θ 3/Bd exp g(θ)veθt tTaking θ v Bt/3and simplifying the above expression, we establishmatrix BernsteinMatrix concentration3-23

Proof of Lemma 3.7eθXeθx 1 θx,x2then for any X with λmax (X) B: I θX eθX I θX I θX X · f (X) · XDefine f (x) I θX f (B) · X 2In addition, we note an elementary inequality: for any 0 θ 3/B,f (B) eθB 1 θB1 X (θB)kθ2 X (θB)k 2θ2 /2 B2B2k!23k 21 θB/3k 2 eθX I θX k 2θ2 /2· X21 θB/3Since X is zero-mean, one further has E eθX I Matrix concentrationθ2 /2E[X 2 ] exp1 θB/3 θ2 /22E[X ]1 θB/33-24

Application: random features

Kernel trickA modern idea in machine learning: replace the inner product bykernel evaluation (i.e. certain similarity measure)Advantage: work beyond the Euclidean domain via task-specificsimilarity measuresMatrix concentration3-26

Similarity measureDefine the similarity measure Φ Φ(x, x) 1 Φ(x, y) 1 Φ(x, y) Φ(y, x)Example: angular similarityΦ(x, y) Matrix concentration2hx, yi2 (x, y)arcsin 1 πkxk2 kyk2π3-27

Kernel matrixConsider N data points x1 , · · · , xN Rd . Then the kernel matrixG RN N isGi,j Φ(xi , xj )1 i, j N Kernel Φ is said to be positive semidefinite if G 0 for any {xi }Challenge: kernel matrices are usually large cost of constructing G is O(dN 2 )Question: can we approximate G more efficiently?Matrix concentration3-28

Random featuresIntroduce a random variable w and a feature map ψ such thatΦ(x, y) Ew [ψ(x; w) · ψ(y; w)] {zdecouple x and y} example (angular similarity)Φ(x, y) 1 2 (x, y) Ew [sgnhx, wi · sgnhy, wi]π{z}(3.2)Grothendieck’s identitywith w uniformly drawn from the unit sphereMatrix concentration3-29

Random featuresIntroduce a random variable w and a feature map ψ such thatΦ(x, y) Ew [ψ(x; w) · ψ(y; w)] {zdecouple x and y} this results in a random feature vectorz1ψ(x1 ; w) . .z . .zNψ(xN ; w) zz {z} is an unbiased estimate of G, i.e. G E[zz ]rank 1Matrix concentration3-29

ExampleAngular similarity:2 (x, y)π Ew [signhx, wi signhy, wi]Φ(x, y) 1 where w is uniformly drawn from the unit sphereAs a result, the random feature map is ψ(x, w) signhx, wiMatrix concentration3-30

Random feature approximationGenerate n independent copies of R zz , i.e. {Rl }1 l nEstimator of the kernel matrix G:Ĝ n1XRln l 1Question: how many random features are needed to guaranteeaccurate estimation?Matrix concentration3-31

Statistical guarantees for random featureapproximationConsider the angular similarity example (3.2): To begin with,E[Rl2 ] E[zz zz ] N E[zz ] N GN1 XnE[Rl2 ] kGk2l 1nn11N2 Next, n kRk n kzk2 n : B v Applying the matrix Bernstein inequality yields: with high prob.skĜ Gk .pv log N B log N .vuN.ut kGk log Nn {z }NNkGk log N log Nnn(for sufficiently large n) 1Matrix concentration3-32

Sample complexityDefine the intrinsic dimension of G asintdim(G) trGN kGkkGkIf n & ε 2 intdim(G) log N , then we havekĜ Gk εkGkMatrix concentration3-33

Reference ”An introduction to matrix concentration inequalities,” J. Tropp,Foundations and Trends in Machine Learning, 2015. ”Convex trace functions and the Wigner-Yanase-Dyson conjecture,”E. Lieb, Advances in Mathematics, 1973. ”User-friendly tail bounds for sums of random matrices,” J. Tropp,Foundations of computational mathematics, 2012. ”Random features for large-scale kernel machines,” A. Rahimi,B. Recht, Neural Information Processing Systems, 2008.Matrix concentration3-34

Matrix Laplace transform method Matrix Bernstein inequality Application: random features Matrix concentration 3-5. Matrix theory background. Matrix function Suppose the eigendecom

Related Documents:

CONTENTS CONTENTS Notation and Nomenclature A Matrix A ij Matrix indexed for some purpose A i Matrix indexed for some purpose Aij Matrix indexed for some purpose An Matrix indexed for some purpose or The n.th power of a square matrix A 1 The inverse matrix of the matrix A A The pseudo inverse matrix of the matrix A (see Sec. 3.6) A1 2 The square root of a matrix (if unique), not elementwise

CONTENTS CONTENTS Notation and Nomenclature A Matrix Aij Matrix indexed for some purpose Ai Matrix indexed for some purpose Aij Matrix indexed for some purpose An Matrix indexed for some purpose or The n.th power of a square matrix A 1 The inverse matrix of the matrix A A The pseudo inverse matrix of the matrix A (see Sec. 3.6) A1/2 The square root of a matrix (if unique), not elementwise

2 Solving Linear Inequalities SEE the Big Idea 2.1 Writing and Graphing Inequalities 2.2 Solving Inequalities Using Addition or Subtraction 2.3 Solving Inequalities Using Multiplication or Division 2.4 Solving Multi-Step Inequalities 2.5 Solving Compound Inequalities Withdraw Money (p.71) Mountain Plant Life (p.77) Microwave Electricity (p.56) Digital Camera (p.

Solving & Graphing Inequalities 2016-01-11 www.njctl.org Slide 3 / 182 Table of Contents Simple Inequalities Addition/Subtraction Simple Inequalities Multiplication/Division Solving Compound Inequalities Special Cases of Compound Inequalities Graphing Linear Inequalities in Slope-Intercept Form click on the topic to go to that section Glossary .

A Matrix A ij Matrix indexed for some purpose A i Matrix indexed for some purpose Aij Matrix indexed for some purpose An Matrix indexed for some purpose or The n.th power of a square matrix A 1 The inverse matrix of the matrix A A The pseudo inverse matrix of the matrix A (see Sec. 3.6) A1/2 The square root of a matrix (if unique), not .

CONTENTS CONTENTS Notation and Nomenclature A Matrix A ij Matrix indexed for some purpose A i Matrix indexed for some purpose Aij Matrix indexed for some purpose An Matrix indexed for some purpose or The n.th power of a square matrix A 1 The inverse matrix of the matrix A A The pseudo inverse matrix of the matrix A (see Sec. 3.6) A1 2 The sq

Compound inequalities are two inequalities joined by the word and or by the word or. Inequalities joined by the word and are called conjunctions. Inequalities joined by the word or are disjunctions. You can represent compound inequalities using words, symbols or graphs. 20. Complete the table. Th e fi rst two rows have been done for you. Verbal .

business entities, business relationships, and property rights—forms the substance of business law and is the main focus of this document. While the predominant concern in a business law course is substantive law, we will first consider the basics of procedural law, the form or organization of the legal system and its methods of conducting .