3y ago

39 Views

2 Downloads

902.65 KB

56 Pages

Transcription

High-dimensional data: Some challenges andrecent progressMartin WainwrightUC BerkeleyDepartments of Statistics, and EECSBased on joint work with:Alekh Agarwahl (UC Berkeley)John Lafferty (Univ. Chicago)Sahand Negahban (UC Berkeley)Pradeep Ravikumar (UT Austin)Martin Wainwright (UC Berkeley)High-dimensional dataAugust 20111 / 34

Era of massive data setsscience and engineering in 21st century: rapid technological advances (sensors, storage, computing etc.)tremendous amounts of data being collectedMartin Wainwright (UC Berkeley)High-dimensional dataAugust 20112 / 34

Era of massive data setsscience and engineering in 21st century: rapid technological advances (sensors, storage, computing etc.)tremendous amounts of data being collectedmany examples: molecular biology: genomics, proteomics, etc.neuroscience: fMRI, PET, EEG,, multi-electrode recording etc.astronomy: Sloan digital sky survey, Large synoptic survey telescope etc.consumer preference data: Netflix, Amazon, etc.geosciences: hyperspectral imagingfinancial data: stocks, bonds, currencies, derivatives etc.Martin Wainwright (UC Berkeley)High-dimensional dataAugust 20112 / 34

Era of massive data setsscience and engineering in 21st century: rapid technological advances (sensors, storage, computing etc.)tremendous amounts of data being collectedmany examples: molecular biology: genomics, proteomics, etc.neuroscience: fMRI, PET, EEG,, multi-electrode recording etc.astronomy: Sloan digital sky survey, Large synoptic survey telescope etc.consumer preference data: Netflix, Amazon, etc.geosciences: hyperspectral imagingfinancial data: stocks, bonds, currencies, derivatives etc.a wealth of data.yet a paucity of informationMartin Wainwright (UC Berkeley)High-dimensional dataAugust 20112 / 34

Era of massive data setsscience and engineering in 21st century: rapid technological advances (sensors, storage, computing etc.)tremendous amounts of data being collectedmany examples: molecular biology: genomics, proteomics, etc.neuroscience: fMRI, PET, EEG,, multi-electrode recording etc.astronomy: Sloan digital sky survey, Large synoptic survey telescope etc.consumer preference data: Netflix, Amazon, etc.geosciences: hyperspectral imagingfinancial data: stocks, bonds, currencies, derivatives etc.a wealth of data.yet a paucity of informationfor statisticians: many exciting challenges and opportunities!Martin Wainwright (UC Berkeley)High-dimensional dataAugust 20112 / 34

A story in three parts1Graphical models 2Exploiting low-rank structure 3Motivating applications: epidemiology, biology, social networksProblem of model selectionNeighborhood-based discoveryMotivating applications: Recommender systems and collaborative filteringNuclear norm as a rank surrogateMatrix decomposition problems Motivating applications: robust PCA, security issues, hidden variablesSparse plus low-rank: a simple relaxationMartin Wainwright (UC Berkeley)High-dimensional dataAugust 20113 / 34

Epidemiological networks(a) Cholera epidemic (London, 1854)Snow, 1855network structure associated with spread of diseaseMartin Wainwright (UC Berkeley)High-dimensional dataAugust 20114 / 34

Epidemiological networks(a) Cholera epidemic (London, 1854)(b) “Spoke-hub” networkSnow, 1855network structure associated with spread of diseaseuseful diagnostic information: contaminated water from Broad StreetpumpMartin Wainwright (UC Berkeley)High-dimensional dataAugust 20114 / 34

Biological networksgene networks during Drosophila life cycle(Ahmed & Xing, PNAS, 2009)

Biological networksgene networks during Drosophila life cycle(Ahmed & Xing, PNAS, 2009)many other examples: protein networksphylogenetic treesneural networks for brain-machine interfaces (e.g., Carmena et al., 2009)

Social networksMcCain ) Biblical characters(b) US senators (2004-2006)www.esv.org(Ravikumar, W. & Lafferty, 2006)Martin Wainwright (UC Berkeley)High-dimensional dataAugust 20116 / 34

Example: Voting and graphical modelsVote of person s:xs ( 1 1if individual s votes “yes”if individual s votes “no”

Example: Voting and graphical modelsVote of person s:xs (1) Independent votingP(x1 , . . . , x5 ) 5Ys 1exp(θs xs )( 1 1if individual s votes “yes”if individual s votes “no”

Example: Voting and graphical modelsVote of person s:xs ( 1 1if individual s votes “yes”if individual s votes “no”(1) Independent votingP(x1 , . . . , x5 ) 5Yexp(θs xs )s 1(2) Cycle-based votingP(x1 , . . . , x5 ) 5Ys 1exp(θs xs )Y(s,t) Cexp(θst xs xt )

Example: Voting and graphical modelsVote of person s:xs ( 1 1if individual s votes “yes”if individual s votes “no”(1) Independent votingP(x1 , . . . , x5 ) 5Yexp(θs xs )s 1(2) Cycle-based votingP(x1 , . . . , x5 ) 5Yexp(θs xs )s 1Yexp(θst xs xt )(s,t) C(3) Full clique votingP(x1 , . . . , x5 ) 5Ys 1exp(θs xs )Ys6 texp(θst xs xt )

Possible voting patterns

Underlying graphs

Markov property and neighborhood structureMarkov properties encode neighborhood structure:d(Xs XV \s ) {z} Condition on full graphX t1(Xs XN (s) ){z} Condition on Markov blanketN (s) {t1 , t2 , t3 , t4 , t5 }X t2X t5XsX t3X t4basis of pseudolikelihood methodused for Gaussian model selection(Besag, 1974)(Meinshausen & Buhlmann, 2006)

Graph selection via neighborhood regressionRavikumar, Wainwright & Lafferty, 2006, 2010Key: Graph recovery G equivalent to recovering neighborhood sets N (s).Method: Based on n samples:1For each node s, predict Xs based on other variables X\s : n 1X(i)b log P(θ; X\s ) θ[s]: arg minn i 1 θ Rp 1 {z} negative log likelihoodλnnXt V \{s} θst {z}ℓ1 regularization 2bb (s) by extracting non-zero positions within θ[s].Estimate local neighborhood N3bCombine the neighborhood estimates to form a graph estimate G.Martin Wainwright (UC Berkeley)High-dimensional dataAugust 201111 / 34

Empirical behavior: Unrescaled plotsStar graph; Linear fraction neighbors1Prob. success0.80.60.40.200p 64p 100p 225100200300400Number of samples500600

Empirical behavior: Appropriately rescaledStar graph; Linear fraction neighbors1Prob. success0.80.60.40.200p 64p 100p 2250.511.5Control parameter2

Illustration: Social network of US senatorsoriginally studied by Bannerjee, Aspremont and El Ghaoui (2008)discrete data set of voting records for p 100 senators:( 1 if senator i voted yes on bill jXij 1 otherwise.full data matrix X Rn p with n 542: X11 X12 · · · X21 X22 · · · X X31 X32 · · · . .··· ···Xn1Xn2··· X1pX2p X3p . . Xnp

Estimated senator network (subgraph of 55)

§2. (Nearly) low-rank matricesΘ d1 d2UDVTr rr d2d1 rMatrix Θ Rd1 d2 with rank r min{d1 , d2 }.Singular value decomposition:matrix of left singular vectors U Rd1 rmatrix of right singular vectors V Rd2 rsingular values σ1 (Θ ) σ2 (Θ ) · · · σr (Θ ) 0.

Example: Matrix completion .4 3. 35 .2543.32 .1 Universe of d1 individuals and d2 films Observe n d1 d2 ratingsTypical numbers for Netflix: d1 105 –108 and d2 106 –1010

Geometry of low-rank modelJawsScaryFunnySadAnnie Hall400 BlowsSoothing

Nuclear norm as a rank surrogateRank as an ℓ0 -“norm” on vector of singular values:rank(Θ ) dX I σj (Θ) 6 0where d min{d1 , d2 }.j 1Non-convexity: rank constraints computationally hard.

Nuclear norm as a rank surrogateRank as an ℓ0 -“norm” on vector of singular values:rank(Θ ) dX I σj (Θ) 6 0where d min{d1 , d2 }.j 1Non-convexity: rank constraints computationally hard.Nuclear norm: convex relaxation of rank: Θ nuc dXj 1σj (Θ).

Nuclear norm as a rank surrogateRank as an ℓ0 -“norm” on vector of singular values:rank(Θ ) dX I σj (Θ) 6 0where d min{d1 , d2 }.j 1Non-convexity: rank constraints computationally hard.Nuclear norm: convex relaxation of rank: Θ nuc dXσj (Θ).j 1Estimator for matrix completion: XbΘ arg minΘ Rd1 d2(a,b) ΩYab Θab 2 λn Θ nuc (Fazel, 2001; Srebro et al., 2004; Candes & Recht, 2009; Negahban & Wainwright, 2010)

Noisy matrix completion (unrescaled)MSE versus raw sample size (q 0)20.5Frob. norm MSE2d 4022d 6022d 8022d 1000.60.40.30.20.1001000200030004000Sample size50006000

Noisy matrix completion (rescaled)MSE versus rescaled sample size (q 0)d2 40222d 60d2 80222d 1000.6Frob. norm MSE0.50.40.30.20.100.511.522.53Rescaled sample size3.54

A simple iterative algorithmProjected gradient descent over nuclear norm ball with stepsize α 0:1 Compute gradient at current iterate Θt(Θtab Yab if entry (a, b) observed.t[ L(Θ )]ab 0otherwise.

A simple iterative algorithmProjected gradient descent over nuclear norm ball with stepsize α 0:1 Compute gradient at current iterate Θt(Θtab Yab if entry (a, b) observed.t[ L(Θ )]ab 0otherwise.2Compute singular value decomposition of matrix Γ Θt α L(Θt ).

A simple iterative algorithmProjected gradient descent over nuclear norm ball with stepsize α 0:1 Compute gradient at current iterate Θt(Θtab Yab if entry (a, b) observed.t[ L(Θ )]ab 0otherwise.2Compute singular value decomposition of matrix Γ Θt α L(Θt ).3Return Θt 1 by soft-thresholding the singular values of Γ at level λn .Implemented by Mazumber, Hastie & Tibshirani, 2009

A simple iterative algorithmProjected gradient descent over nuclear norm ball with stepsize α 0:1 Compute gradient at current iterate Θt(Θtab Yab if entry (a, b) observed.t[ L(Θ )]ab 0otherwise.2Compute singular value decomposition of matrix Γ Θt α L(Θt ).3Return Θt 1 by soft-thresholding the singular values of Γ at level λn .Implemented by Mazumber, Hastie & Tibshirani, 2009Question:How quickly does this algorithm converge?

A simple iterative algorithmProjected gradient descent over nuclear norm ball with stepsize α 0:1 Compute gradient at current iterate Θt(Θtab Yab if entry (a, b) observed.t[ L(Θ )]ab 0otherwise.2Compute singular value decomposition of matrix Γ Θt α L(Θt ).3Return Θt 1 by soft-thresholding the singular values of Γ at level λn .Implemented by Mazumber, Hastie & Tibshirani, 2009Question:How quickly does this algorithm converge?Without additional structure, would expect slow (sub-linear) convergence:b 2 1. Θt Θ Ft

Sub-linear versus linear convergenceSub linear versus linear convergence010 2Log optimization error10 410 610 810 1010 1210Sub linearLinear 1410010203040Number of iterations5060

Fast convergence rates for matrix completionq 0 , d2 40000log(kΘt Θ̂k) (rescaled)2125250 2 4 6 8 10204060Iteration Count80100

§3. Matrix decomposition: Low-rank plus sparseMatrix Y can be (approximately) decomposed into sum:UYd1 d2 Y d1 rDVTr rr d2 Θ Γ {z} {z}Low-rank component Sparse component

§3. Matrix decomposition: Low-rank plus sparseMatrix Y can be (approximately) decomposed into sum:UYd1 d2 Y DVTr rr d2d1 r Θ Γ {z} {z}Low-rank component Sparse componentexact decomposition: initially studied by Chandrasekaran et al., 2009Various applications: robust collaborative filteringgraphical model selection with hidden variablesimage/video segmentation

Matrix decomposition: Low-rank plus column sparseMatrix Y can be (approximately) decomposed into sum:UYd1 d2Y d1 rDVTr rr d2 Θ Γ {z} {z}Low-rank component Column sparse component

Matrix decomposition: Low-rank plus column sparseMatrix Y can be (approximately) decomposed into sum:UYd1 d2Y DVTr rr d2d1 r Θ Γ {z} {z}Low-rank component Column sparse componentexact decomposition: initially studied by Xu et al., 2010Various applications: robust collaborative filteringrobust principal components analysis

Example: Collaborative filtering .4 3. 35 .2543.32 .1Universe of d1 individuals and d2 films Observe n d2 d2 ratings(e.g., Srebro, Alon & Jaakkola, 2004)

Security and robustness issuesSpiritual guideBreak-down of Amazon recommendation system (New York Times, 2002).

Security and robustness issuesSpiritual guideSex manualBreak-down of Amazon recommendation system (New York Times, 2002).

Example: Robustness in PCAStandard PCA fits a low-rank matrix to a data matrix.

Example: Robustness in PCAA small amount of data corruption can have a large influence.

Example: Structure of Gauss-Markov random fieldsZero pattern of inverse covariance112234355123454Multivariate Gaussian with graph-structured inverse covariance Γ : 1P(x1 , x2 , . . . , xp ) exp xT Γ x .2

Gauss-Markov models with hidden variableszx1x2x3x4Problems with hidden variables: conditioned on hidden z, vectorx (x1 , x2 , x3 , x4 ) is Gauss-Markov.

Gauss-Markov models with hidden variableszx1x2x3x4Problems with hidden variables: conditioned on hidden z, vectorx (x1 , x2 , x3 , x4 ) is Gauss-Markov.Inverse covariance of x satisfies {sparse, low-rank} decomposition: 1 µµµµ µ1 µµµ I4 4 µ11T . µµ1 µµ µµµ1 µ(Chandrasekaran, Parrilo & Willsky, 2010)

Method for noisy matrix decompositionΘ Yn1 n2 Γ W Given noisy observations:Y Θ Γ WMartin Wainwright (UC Berkeley)High-dimensional dataAugust 201132 / 34

Method for noisy matrix decompositionΘ Yn1 n2Γ W Given noisy observations:Y Θ Γ WSolve convex program bb(Θ, Γ) arg min Y (Θ Γ) 2frob λd Θ nuc µd kΓk1(Θ,Γ)plus “spikiness” constraint kΘk Martin Wainwright (UC Berkeley) αd .d1 d2High-dimensional dataAugust 201132 / 34

IllustrationOriginal observations

IllustrationLow rank componentSparse componentLow rank componentSparse componentNoise matrix WNoise matrix W

Summarycharacteristics of modern data sets: large-scale: many samples, many predictorshigh-dimensional: data dimension may exceed sample sizechallenges and opportunities for statisticians: how to model low-dimensional structure?new theory: non-asymptotic, allowing for high-dimensional scalingcloser coupling between statistical and computational concerns

Summarycharacteristics of modern data sets: large-scale: many samples, many predictorshigh-dimensional: data dimension may exceed sample sizechallenges and opportunities for statisticians: how to model low-dimensional structure?new theory: non-asymptotic, allowing for high-dimensional scalingcloser coupling between statistical and computational concernsSome references:High-dimensional Ising model selection using ℓ1 -regularized logistic regression(2010). Annals of Statistics, 38(3): 1287–1317. With P. Ravikumar and J.Lafferty.Estimation rates of (near) low-rank matrices with noise and high-dimensionalscaling (2011). Annals of Statistics, 39(2): 1069–1097. With S. Negahban.Restricted strong convexity and (weighted) matrix completion: Optimal boundswith noise. arxiv.org/abs/0112.5100, September 2010, With S. Negahban.Noisy matrix decomposition via convex relaxation: Optimal rates in highdimensions. http://arxiv.org/abs/1102.4807, February 2011. With A. Agarwaland S. Negahban.

A story in three parts 1 Graphical models Motivating applications: epidemiology, biology, social networks Problem of model selection Neighborhood-based discovery 2 Exploiting low-rank structure Motivating applications: Recommender systems and collaborative ﬁltering Nuclear norm as a rank surrogate 3 Matrix decomposition problems Motivating applications: robust PCA, security issues, hidden .

Related Documents: