High-dimensional Data: Some Challenges And Recent Progress

3y ago
24 Views
2 Downloads
902.65 KB
56 Pages
Last View : 2m ago
Last Download : 2m ago
Upload by : Mollie Blount
Transcription

High-dimensional data: Some challenges andrecent progressMartin WainwrightUC BerkeleyDepartments of Statistics, and EECSBased on joint work with:Alekh Agarwahl (UC Berkeley)John Lafferty (Univ. Chicago)Sahand Negahban (UC Berkeley)Pradeep Ravikumar (UT Austin)Martin Wainwright (UC Berkeley)High-dimensional dataAugust 20111 / 34

Era of massive data setsscience and engineering in 21st century: rapid technological advances (sensors, storage, computing etc.)tremendous amounts of data being collectedMartin Wainwright (UC Berkeley)High-dimensional dataAugust 20112 / 34

Era of massive data setsscience and engineering in 21st century: rapid technological advances (sensors, storage, computing etc.)tremendous amounts of data being collectedmany examples: molecular biology: genomics, proteomics, etc.neuroscience: fMRI, PET, EEG,, multi-electrode recording etc.astronomy: Sloan digital sky survey, Large synoptic survey telescope etc.consumer preference data: Netflix, Amazon, etc.geosciences: hyperspectral imagingfinancial data: stocks, bonds, currencies, derivatives etc.Martin Wainwright (UC Berkeley)High-dimensional dataAugust 20112 / 34

Era of massive data setsscience and engineering in 21st century: rapid technological advances (sensors, storage, computing etc.)tremendous amounts of data being collectedmany examples: molecular biology: genomics, proteomics, etc.neuroscience: fMRI, PET, EEG,, multi-electrode recording etc.astronomy: Sloan digital sky survey, Large synoptic survey telescope etc.consumer preference data: Netflix, Amazon, etc.geosciences: hyperspectral imagingfinancial data: stocks, bonds, currencies, derivatives etc.a wealth of data.yet a paucity of informationMartin Wainwright (UC Berkeley)High-dimensional dataAugust 20112 / 34

Era of massive data setsscience and engineering in 21st century: rapid technological advances (sensors, storage, computing etc.)tremendous amounts of data being collectedmany examples: molecular biology: genomics, proteomics, etc.neuroscience: fMRI, PET, EEG,, multi-electrode recording etc.astronomy: Sloan digital sky survey, Large synoptic survey telescope etc.consumer preference data: Netflix, Amazon, etc.geosciences: hyperspectral imagingfinancial data: stocks, bonds, currencies, derivatives etc.a wealth of data.yet a paucity of informationfor statisticians: many exciting challenges and opportunities!Martin Wainwright (UC Berkeley)High-dimensional dataAugust 20112 / 34

A story in three parts1Graphical models 2Exploiting low-rank structure 3Motivating applications: epidemiology, biology, social networksProblem of model selectionNeighborhood-based discoveryMotivating applications: Recommender systems and collaborative filteringNuclear norm as a rank surrogateMatrix decomposition problems Motivating applications: robust PCA, security issues, hidden variablesSparse plus low-rank: a simple relaxationMartin Wainwright (UC Berkeley)High-dimensional dataAugust 20113 / 34

Epidemiological networks(a) Cholera epidemic (London, 1854)Snow, 1855network structure associated with spread of diseaseMartin Wainwright (UC Berkeley)High-dimensional dataAugust 20114 / 34

Epidemiological networks(a) Cholera epidemic (London, 1854)(b) “Spoke-hub” networkSnow, 1855network structure associated with spread of diseaseuseful diagnostic information: contaminated water from Broad StreetpumpMartin Wainwright (UC Berkeley)High-dimensional dataAugust 20114 / 34

Biological networksgene networks during Drosophila life cycle(Ahmed & Xing, PNAS, 2009)

Biological networksgene networks during Drosophila life cycle(Ahmed & Xing, PNAS, 2009)many other examples: protein networksphylogenetic treesneural networks for brain-machine interfaces (e.g., Carmena et al., 2009)

Social networksMcCain ) Biblical characters(b) US senators (2004-2006)www.esv.org(Ravikumar, W. & Lafferty, 2006)Martin Wainwright (UC Berkeley)High-dimensional dataAugust 20116 / 34

Example: Voting and graphical modelsVote of person s:xs ( 1 1if individual s votes “yes”if individual s votes “no”

Example: Voting and graphical modelsVote of person s:xs (1) Independent votingP(x1 , . . . , x5 ) 5Ys 1exp(θs xs )( 1 1if individual s votes “yes”if individual s votes “no”

Example: Voting and graphical modelsVote of person s:xs ( 1 1if individual s votes “yes”if individual s votes “no”(1) Independent votingP(x1 , . . . , x5 ) 5Yexp(θs xs )s 1(2) Cycle-based votingP(x1 , . . . , x5 ) 5Ys 1exp(θs xs )Y(s,t) Cexp(θst xs xt )

Example: Voting and graphical modelsVote of person s:xs ( 1 1if individual s votes “yes”if individual s votes “no”(1) Independent votingP(x1 , . . . , x5 ) 5Yexp(θs xs )s 1(2) Cycle-based votingP(x1 , . . . , x5 ) 5Yexp(θs xs )s 1Yexp(θst xs xt )(s,t) C(3) Full clique votingP(x1 , . . . , x5 ) 5Ys 1exp(θs xs )Ys6 texp(θst xs xt )

Possible voting patterns

Underlying graphs

Markov property and neighborhood structureMarkov properties encode neighborhood structure:d(Xs XV \s ) {z} Condition on full graphX t1(Xs XN (s) ){z} Condition on Markov blanketN (s) {t1 , t2 , t3 , t4 , t5 }X t2X t5XsX t3X t4basis of pseudolikelihood methodused for Gaussian model selection(Besag, 1974)(Meinshausen & Buhlmann, 2006)

Graph selection via neighborhood regressionRavikumar, Wainwright & Lafferty, 2006, 2010Key: Graph recovery G equivalent to recovering neighborhood sets N (s).Method: Based on n samples:1For each node s, predict Xs based on other variables X\s : n 1X(i)b log P(θ; X\s ) θ[s]: arg minn i 1 θ Rp 1 {z} negative log likelihoodλnnXt V \{s} θst {z}ℓ1 regularization 2bb (s) by extracting non-zero positions within θ[s].Estimate local neighborhood N3bCombine the neighborhood estimates to form a graph estimate G.Martin Wainwright (UC Berkeley)High-dimensional dataAugust 201111 / 34

Empirical behavior: Unrescaled plotsStar graph; Linear fraction neighbors1Prob. success0.80.60.40.200p 64p 100p 225100200300400Number of samples500600

Empirical behavior: Appropriately rescaledStar graph; Linear fraction neighbors1Prob. success0.80.60.40.200p 64p 100p 2250.511.5Control parameter2

Illustration: Social network of US senatorsoriginally studied by Bannerjee, Aspremont and El Ghaoui (2008)discrete data set of voting records for p 100 senators:( 1 if senator i voted yes on bill jXij 1 otherwise.full data matrix X Rn p with n 542: X11 X12 · · · X21 X22 · · · X X31 X32 · · · . .··· ···Xn1Xn2··· X1pX2p X3p . . Xnp

Estimated senator network (subgraph of 55)

§2. (Nearly) low-rank matricesΘ d1 d2UDVTr rr d2d1 rMatrix Θ Rd1 d2 with rank r min{d1 , d2 }.Singular value decomposition:matrix of left singular vectors U Rd1 rmatrix of right singular vectors V Rd2 rsingular values σ1 (Θ ) σ2 (Θ ) · · · σr (Θ ) 0.

Example: Matrix completion .4 3. 35 .2543.32 .1 Universe of d1 individuals and d2 films Observe n d1 d2 ratingsTypical numbers for Netflix: d1 105 –108 and d2 106 –1010

Geometry of low-rank modelJawsScaryFunnySadAnnie Hall400 BlowsSoothing

Nuclear norm as a rank surrogateRank as an ℓ0 -“norm” on vector of singular values:rank(Θ ) dX I σj (Θ) 6 0where d min{d1 , d2 }.j 1Non-convexity: rank constraints computationally hard.

Nuclear norm as a rank surrogateRank as an ℓ0 -“norm” on vector of singular values:rank(Θ ) dX I σj (Θ) 6 0where d min{d1 , d2 }.j 1Non-convexity: rank constraints computationally hard.Nuclear norm: convex relaxation of rank: Θ nuc dXj 1σj (Θ).

Nuclear norm as a rank surrogateRank as an ℓ0 -“norm” on vector of singular values:rank(Θ ) dX I σj (Θ) 6 0where d min{d1 , d2 }.j 1Non-convexity: rank constraints computationally hard.Nuclear norm: convex relaxation of rank: Θ nuc dXσj (Θ).j 1Estimator for matrix completion: XbΘ arg minΘ Rd1 d2(a,b) ΩYab Θab 2 λn Θ nuc (Fazel, 2001; Srebro et al., 2004; Candes & Recht, 2009; Negahban & Wainwright, 2010)

Noisy matrix completion (unrescaled)MSE versus raw sample size (q 0)20.5Frob. norm MSE2d 4022d 6022d 8022d 1000.60.40.30.20.1001000200030004000Sample size50006000

Noisy matrix completion (rescaled)MSE versus rescaled sample size (q 0)d2 40222d 60d2 80222d 1000.6Frob. norm MSE0.50.40.30.20.100.511.522.53Rescaled sample size3.54

A simple iterative algorithmProjected gradient descent over nuclear norm ball with stepsize α 0:1 Compute gradient at current iterate Θt(Θtab Yab if entry (a, b) observed.t[ L(Θ )]ab 0otherwise.

A simple iterative algorithmProjected gradient descent over nuclear norm ball with stepsize α 0:1 Compute gradient at current iterate Θt(Θtab Yab if entry (a, b) observed.t[ L(Θ )]ab 0otherwise.2Compute singular value decomposition of matrix Γ Θt α L(Θt ).

A simple iterative algorithmProjected gradient descent over nuclear norm ball with stepsize α 0:1 Compute gradient at current iterate Θt(Θtab Yab if entry (a, b) observed.t[ L(Θ )]ab 0otherwise.2Compute singular value decomposition of matrix Γ Θt α L(Θt ).3Return Θt 1 by soft-thresholding the singular values of Γ at level λn .Implemented by Mazumber, Hastie & Tibshirani, 2009

A simple iterative algorithmProjected gradient descent over nuclear norm ball with stepsize α 0:1 Compute gradient at current iterate Θt(Θtab Yab if entry (a, b) observed.t[ L(Θ )]ab 0otherwise.2Compute singular value decomposition of matrix Γ Θt α L(Θt ).3Return Θt 1 by soft-thresholding the singular values of Γ at level λn .Implemented by Mazumber, Hastie & Tibshirani, 2009Question:How quickly does this algorithm converge?

A simple iterative algorithmProjected gradient descent over nuclear norm ball with stepsize α 0:1 Compute gradient at current iterate Θt(Θtab Yab if entry (a, b) observed.t[ L(Θ )]ab 0otherwise.2Compute singular value decomposition of matrix Γ Θt α L(Θt ).3Return Θt 1 by soft-thresholding the singular values of Γ at level λn .Implemented by Mazumber, Hastie & Tibshirani, 2009Question:How quickly does this algorithm converge?Without additional structure, would expect slow (sub-linear) convergence:b 2 1. Θt Θ Ft

Sub-linear versus linear convergenceSub linear versus linear convergence010 2Log optimization error10 410 610 810 1010 1210Sub linearLinear 1410010203040Number of iterations5060

Fast convergence rates for matrix completionq 0 , d2 40000log(kΘt Θ̂k) (rescaled)2125250 2 4 6 8 10204060Iteration Count80100

§3. Matrix decomposition: Low-rank plus sparseMatrix Y can be (approximately) decomposed into sum:UYd1 d2 Y d1 rDVTr rr d2 Θ Γ {z} {z}Low-rank component Sparse component

§3. Matrix decomposition: Low-rank plus sparseMatrix Y can be (approximately) decomposed into sum:UYd1 d2 Y DVTr rr d2d1 r Θ Γ {z} {z}Low-rank component Sparse componentexact decomposition: initially studied by Chandrasekaran et al., 2009Various applications: robust collaborative filteringgraphical model selection with hidden variablesimage/video segmentation

Matrix decomposition: Low-rank plus column sparseMatrix Y can be (approximately) decomposed into sum:UYd1 d2Y d1 rDVTr rr d2 Θ Γ {z} {z}Low-rank component Column sparse component

Matrix decomposition: Low-rank plus column sparseMatrix Y can be (approximately) decomposed into sum:UYd1 d2Y DVTr rr d2d1 r Θ Γ {z} {z}Low-rank component Column sparse componentexact decomposition: initially studied by Xu et al., 2010Various applications: robust collaborative filteringrobust principal components analysis

Example: Collaborative filtering .4 3. 35 .2543.32 .1Universe of d1 individuals and d2 films Observe n d2 d2 ratings(e.g., Srebro, Alon & Jaakkola, 2004)

Security and robustness issuesSpiritual guideBreak-down of Amazon recommendation system (New York Times, 2002).

Security and robustness issuesSpiritual guideSex manualBreak-down of Amazon recommendation system (New York Times, 2002).

Example: Robustness in PCAStandard PCA fits a low-rank matrix to a data matrix.

Example: Robustness in PCAA small amount of data corruption can have a large influence.

Example: Structure of Gauss-Markov random fieldsZero pattern of inverse covariance112234355123454Multivariate Gaussian with graph-structured inverse covariance Γ : 1P(x1 , x2 , . . . , xp ) exp xT Γ x .2

Gauss-Markov models with hidden variableszx1x2x3x4Problems with hidden variables: conditioned on hidden z, vectorx (x1 , x2 , x3 , x4 ) is Gauss-Markov.

Gauss-Markov models with hidden variableszx1x2x3x4Problems with hidden variables: conditioned on hidden z, vectorx (x1 , x2 , x3 , x4 ) is Gauss-Markov.Inverse covariance of x satisfies {sparse, low-rank} decomposition: 1 µµµµ µ1 µµµ I4 4 µ11T . µµ1 µµ µµµ1 µ(Chandrasekaran, Parrilo & Willsky, 2010)

Method for noisy matrix decompositionΘ Yn1 n2 Γ W Given noisy observations:Y Θ Γ WMartin Wainwright (UC Berkeley)High-dimensional dataAugust 201132 / 34

Method for noisy matrix decompositionΘ Yn1 n2Γ W Given noisy observations:Y Θ Γ WSolve convex program bb(Θ, Γ) arg min Y (Θ Γ) 2frob λd Θ nuc µd kΓk1(Θ,Γ)plus “spikiness” constraint kΘk Martin Wainwright (UC Berkeley) αd .d1 d2High-dimensional dataAugust 201132 / 34

IllustrationOriginal observations

IllustrationLow rank componentSparse componentLow rank componentSparse componentNoise matrix WNoise matrix W

Summarycharacteristics of modern data sets: large-scale: many samples, many predictorshigh-dimensional: data dimension may exceed sample sizechallenges and opportunities for statisticians: how to model low-dimensional structure?new theory: non-asymptotic, allowing for high-dimensional scalingcloser coupling between statistical and computational concerns

Summarycharacteristics of modern data sets: large-scale: many samples, many predictorshigh-dimensional: data dimension may exceed sample sizechallenges and opportunities for statisticians: how to model low-dimensional structure?new theory: non-asymptotic, allowing for high-dimensional scalingcloser coupling between statistical and computational concernsSome references:High-dimensional Ising model selection using ℓ1 -regularized logistic regression(2010). Annals of Statistics, 38(3): 1287–1317. With P. Ravikumar and J.Lafferty.Estimation rates of (near) low-rank matrices with noise and high-dimensionalscaling (2011). Annals of Statistics, 39(2): 1069–1097. With S. Negahban.Restricted strong convexity and (weighted) matrix completion: Optimal boundswith noise. arxiv.org/abs/0112.5100, September 2010, With S. Negahban.Noisy matrix decomposition via convex relaxation: Optimal rates in highdimensions. http://arxiv.org/abs/1102.4807, February 2011. With A. Agarwaland S. Negahban.

A story in three parts 1 Graphical models Motivating applications: epidemiology, biology, social networks Problem of model selection Neighborhood-based discovery 2 Exploiting low-rank structure Motivating applications: Recommender systems and collaborative filtering Nuclear norm as a rank surrogate 3 Matrix decomposition problems Motivating applications: robust PCA, security issues, hidden .

Related Documents:

with a layout of the entire data set as basis, the e ectiveness of these aggregated/approximate visualizations will only be improved. Many visualization tools are designed to layout geographical data, sensor data, and network data. These tools typically cannot handle high-dimensional data. Many recent successes of visualizing high-dimensional data

In this paper, we propose to manage the high dimensional data in a systematical way and present the design of Saber, an end-to-end high dimensional data management system. Saber features scalability, high performance, and ease of use and con gure. It consists of several modules, including data ingestion, storage management, index management, and

orthographic drawings To draw nets for three-dimensional figures. . .And Why To make a foundation drawing, as in Example 3 You will study both two-dimensional and three-dimensional figures in geometry. A drawing on a piece of paper is a two-dimensional object. It has length and width. Your textbook is a three-dimensional object.

instabilities that occurred in two dimensional and three dimensional simulations are performed by Van Berkel et al. (2002) in a thermocline based water storage tank. In two-dimensional simulations the entrainment velocity was 40% higher than that found in the corresponding three dimensional simulations.

The Matlab Hilbert transform operates on one-dimensional data. For the work described here, it is necessary to adapt the HT to two-dimensional data. I do this by analogy to the two-dimensional Fourier Transform (strictly, the "discrete Fourier Transform"). Basic equations (e.g. Lim, 1990) show that the two-dimensional discrete Fourier

TheData Warehouse Toolkit The Definitive Guideto Dimensional Modeling ThirdEdition RalphKimball MargyRoss Wiley. Contents 1 Data Warehousing, Business Intelligence, and Dimensional . Four-Step Dimensional Design Process 38 Business Processes 39 Grain 39 DimensionsforDescriptive Context 40 Facts for Measurements 40 Star SchemasandOLAPCubes 40

Coding Challenges Our targeted coding challenges booklet provides a set of coding challenges that students can use as practice to get used to the process of Design, Write, Test and Refine process using a highlevel - text-based language. Algorithm Challenges Our algorithm challenges booklet provides 40 algorithm challenges which help build

High dimensional data analysis Cavan Reilly October 16, 2019. Table of contents Data mining Random forests Missing data Logic regression Multivariate adaptive regression splines. Data mining Data mining is the proce