Detecting Novel Associations In Large Data Sets David N .

2y ago
20 Views
2 Downloads
2.26 MB
8 Pages
Last View : 23d ago
Last Download : 2m ago
Upload by : Ronan Orellana
Transcription

Detecting Novel Associations in Large Data SetsDavid N. Reshef, et al.Science 334, 1518 (2011);DOI: 10.1126/science.1205438This copy is for your personal, non-commercial use only.If you wish to distribute this article to others, you can order high-quality copies for yourcolleagues, clients, or customers by clicking here.The following resources related to this article are available online atwww.sciencemag.org (this infomation is current as of December 15, 2011 ):Updated information and services, including high-resolution figures, can be found in the onlineversion of this article .full.htmlSupporting Online Material can be found his article cites 35 articles, 6 of which can be accessed 18.full.html#ref-list-1This article has been cited by 1 articles hosted by HighWire Press; 8.full.html#related-urlsScience (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. Copyright2011 by the American Association for the Advancement of Science; all rights reserved. The title Science is aregistered trademark of AAAS.Downloaded from www.sciencemag.org on December 16, 2011Permission to republish or repurpose articles or portions of articles can be obtained byfollowing the guidelines here.

RESEARCH ARTICLESDepartment of Computer Science, Massachusetts Institute ofTechnology (MIT), Cambridge, MA 02139, USA. 2Broad Instituteof MIT and Harvard, Cambridge, MA 02142, USA. 3Departmentof Statistics, University of Oxford, Oxford OX1 3TG, UK. 4Department of Mathematics, Harvard College, Cambridge, MA02138, USA. 5Department of Computer Science and AppliedMathematics, Weizmann Institute of Science, Rehovot, Israel.6Center for Systems Biology, Department of Organismic andEvolutionary Biology, Harvard University, Cambridge, MA 02138,USA. 7Wellcome Trust Centre for Human Genetics, University ofOxford, Oxford OX3 7BN, UK. 8Department of Biology, MIT,Cambridge, MA 02139, USA. 9Department of Systems Biology,Harvard Medical School, Boston, MA 02115, USA. 10School ofEngineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA.*These authors contributed equally to this work.†To whom correspondence should be addressed. E-mail:dnreshef@mit.edu (D.N.R.); yreshef@post.harvard.edu (Y.A.R.)‡These authors contributed equally to this work.151816 DECEMBER 2011VOL ertical30BinsFig. 1. Computing MIC (A) For each pair (x,y), theMIC algorithm finds the x-by-y grid with the highestinduced mutual information. (B) The algorithmnormalizes the mutual information scores andcompiles a matrix that stores, for each resolution,the best grid at that resolution and its normalizedscore. (C) The normalized scores form the characteristic matrix, which can be visualized as a surface; MIC corresponds to the highest point on thissurface. In this example, there are many grids thatachieve the highest score. The star in (B) marks asample grid achieving this score, and the star in (C)marks that grid’s corresponding location on thesurface.www.sciencemag.orgDownloaded from www.sciencemag.org on December 16, 20112x3insis BAxtalonrizHo1not only do relationships take many functionalforms, but many important relationships—for example, a superposition of functions—are not wellmodeled by a function (4–7).By equitability, we mean that the statisticshould give similar scores to equally noisy relationships of different types. For example, we donot want noisy linear relationships to drive strongsinusoidal relationships from the top of the list.Equitability is difficult to formalize for associations in general but has a clear interpretation inthe basic case of functional relationships: An equitable statistic should give similar scores to functional relationships with similar R2 values (givensufficient sample size).Here, we describe an exploratory data analysis tool, the maximal information coefficient(MIC), that satisfies these two heuristic properties. We establish MIC’s generality through proofs,show its equitability on functional relationshipsthrough simulations, and observe that this translates into intuitively equitable behavior on moregeneral associations. Furthermore, we illustratethat MIC gives rise to a larger family of statistics, which we refer to as MINE, or maximalinformation-based nonparametric exploration.MINE statistics can be used not only to identifyinteresting associations, but also to characterizethem according to properties such as nonlinearity and monotonicity. We demonstrate theapplication of MIC and MINE to data sets inhealth, baseball, genomics, and the humanmicrobiota.The maximal information coefficient. Intuitively, MIC is based on the idea that if a relationship exists between two variables, then agrid can be drawn on the scatterplot of the twovariables that partitions the data to encapsulatethat relationship. Thus, to calculate the MIC of aset of two-variable data, we explore all grids upto a maximal grid resolution, dependent on thesample size (Fig. 1A), computing for every pair2x23Imagine a data set with hundreds of variables,which may contain important, undiscoveredrelationships. There are tens of thousands ofvariable pairs—far too many to examine manually. If you do not already know what kinds ofrelationships to search for, how do you efficientlyidentify the important ones? Data sets of this sizeare increasingly common in fields as varied asgenomics, physics, political science, and economics, making this question an important and growing challenge (1, 2).One way to begin exploring a large data setis to search for pairs of variables that are closelyassociated. To do this, we could calculate somemeasure of dependence for each pair, rank thepairs by their scores, and examine the top-scoringpairs. For this strategy to work, the statistic weuse to measure dependence should have two heuristic properties: generality and equitability.By generality, we mean that with sufficientsample size the statistic should capture a widerange of interesting associations, not limited tospecific function types (such as linear, exponential,or periodic), or even to all functional relationships (3). The latter condition is desirable becauseARowsIdentifying interesting relationships between pairs of variables in large data sets is increasinglyimportant. Here, we present a measure of dependence for two-variable relationships: the maximalinformation coefficient (MIC). MIC captures a wide range of associations both functional andnot, and for functional relationships provides a score that roughly equals the coefficient ofdetermination (R2) of the data relative to the regression function. MIC belongs to a largerclass of maximal information-based nonparametric exploration (MINE) statistics for identifyingand classifying relationships. We apply MIC and MINE to data sets in global health, geneexpression, major-league baseball, and the human gut microbiota and identify known andnovel relationships.David N. Reshef,1,2,3*† Yakir A. Reshef,2,4*† Hilary K. Finucane,5 Sharon R. Grossman,2,6Gilean McVean,3,7 Peter J. Turnbaugh,6 Eric S. Lander,2,8,9Michael Mitzenmacher,10‡ Pardis C. Sabeti2,6‡Normalized ScoreDetecting Novel Associationsin Large Data Setsof integers (x,y) the largest possible mutual information achievable by any x-by-y grid appliedto the data. We then normalize these mutualinformation values to ensure a fair comparisonbetween grids of different dimensions and to obtain modified values between 0 and 1. We define the characteristic matrix M (mx,y), wheremx,y is the highest normalized mutual information achieved by any x-by-y grid, and thestatistic MIC to be the maximum value in M(Fig. 1, B and C).More formally, for a grid G, let IG denotethe mutual information of the probability dis-

RESEARCH ARTICLES(x,y)-th entry mx,y of the characteristic matrixequals max{IG}/log min{x,y}, where the maximum is taken over all x-by-y grids G. MIC is 1-0.110.020.060.380.76(non-Fourier frequency)(varying frequency)GMaximal Information Coefficient (MIC)0.80Relationship Type0.650.500.35CDTwo LinesLine and ParabolaXEllipseSinusoid(Mixture of two signals)ENon-coexistenceFig. 2. Comparison of MIC to existing methods (A) Scores given to variousnoiseless functional relationships by several different statistics (8, 12, 14, 19).Maximal scores in each column are accentuated. (B to F) The MIC, Spearmancorrelation coefficient, mutual information (Kraskov et al. estimator), maximalcorrelation (estimated using ACE), and the principal curve-based CorGC dependence measure, respectively, of 27 different functional relationships withindependent uniform vertical noise added, as the R2 value of the data relative tothe noiseless function varies. Each shape and color corresponds to a differentcombination of function type and sample size. In each plot, pairs of thumbnailsshow relationships that received identical scores; for data exploration, we would Flike these pairs to have similar noise levels. For a list of the functions and samplesizes in these graphs as well as versions with other statistics, sample sizes, andnoise models, see figs. S3 and S4. (G) Performance of MIC on associations notwell modeled by a function, as noise level varies. For the performance of otherstatistics, see figs. S5 and .81**0.60.40.20Added Noisewww.sciencemag.org0.81Mutual Information (Kraskov et al.)(Fourier frequency)**1Maximal Information Coefficient(MIC)Pearson SpearmanMaximalCorrelation00.25.02.521.5* **10.500*0.20.40.61Maximal Correlation (ACE)MICCorGC0.8**0.60.40.2000.20.4*0.60.8*11CorGC (Principal Curve-Based)Relationship TypeMutual InformationSquared Spearman RankCorrelation CoefficientAmaximum of mx,y over ordered pairs (x,y) suchthat xy B, where B is a function of sample size;we usually set B n0.6 (see SOM Section 2.2.1).Downloaded from www.sciencemag.org on December 16, 2011tribution induced on the boxes of G, where theprobability of a box is proportional to the number of data points falling inside the box. The*0.8*0.60.4**0.200VOL 3340.20.40.6Noise (1 - R2)0.816 DECEMBER 201111519

Every entry of M falls between 0 and 1, andso MIC does as well. MIC is also symmetric [i.e.,MIC(X, Y ) MIC(Y, X )] due to the symmetry ofmutual information, and because IG dependsonly on the rank order of the data, MIC is invariant under order-preserving transformations ofthe axes. Notably, although mutual informationis used to quantify the performance of each grid,MIC is not an estimate of mutual information(SOM Section 2).To calculate M, we would ideally optimizeover all possible grids. For computational efficiency, we instead use a dynamic programmingalgorithm that optimizes over a subset of the possible grids and appears to approximate well thetrue value of MIC in practice (SOM Section 3).Main properties of MIC. We have provenmathematically that MIC is general in the sensedescribed above. Our proofs show that, with probability approaching 1 as sample size grows, (i)MIC assigns scores that tend to 1 to all neverconstant noiseless functional relationships; (ii)MIC assigns scores that tend to 1 for a largerclass of noiseless relationships (including superpositions of noiseless functional relationships);and (iii) MIC assigns scores that tend to 0 tostatistically independent variables.Specifically, we have proven that for a pair ofrandom variables X and Y, (i) if Y is a function ofX that is not constant on any open interval, thendata drawn from (X,Y ) will receive an MIC tending to 1 with probability one as sample size grows;(ii) if the support of (X,Y ) is described by afinite union of differentiable curves of the formc(t) [x(t),y(t)] for t in [0,1], then data drawn from(X,Y ) will receive an MIC tending to 1 withprobability one as sample size grows, providedthat dx/dt and dy/dt are each zero on finitelymany points; (iii) the MIC of data drawn from(X,Y ) converges to zero in probability as samplesize grows if and only if X and Y are statistically independent. We have also proven thatthe MIC of a noisy functional relationship isbounded from below by a function of its R2.(For proofs, see SOM.)We tested MIC’s equitability through simulations. These simulations confirm the mathematical result that noiseless functional relationships(i.e., R2 1.0) receive MIC scores approaching1.0 (Fig. 2A). They also show that, for a largecollection of test functions with varied samplesizes, noise levels, and noise models, MIC roughly equals the coefficient of determination R2 relative to each respective noiseless function. ThisAB1.0ins300Ver 10tical Ax10is B20ins300nsBixislAtaon00.5300.0200Ver1010tical Axis B 20insBinsnsBixis200alANormalized ScoreBin0.0Horizont20is B30ontalAVer 10tical Ax10xis2000.5riz0.0s30301.01.00.5insFHoNormalized Score1.0Normalized ScorensBiEis Briz020Ho30l AxxisDinslAxisis BticaontalA01010Ver300riz300Hoins20Normalized Scoreis B0.0taon20l Axriztica10Ver 10tical Ho0.50.0C1.0Normalized ScoreNormalized Score1.0makes it easy to interpret and compare scoresacross various function types (Fig. 2B and fig.S4). For instance, at reasonable sample sizes, asinusoidal relationship with a noise level of R2 0.80 and a linear relationship with the same R2value receive nearly the same MIC score. For awide range of associations that are not wellmodeled by a function, we also show that MICscores degrade in an intuitive manner as noiseis added (Fig. 2G and figs. S5 and S6).Comparisons to other methods. We comparedMIC to a wide range of methods—including methods formulated around the axiomatic frameworkfor measures of dependence developed by Rényi(8), other state-of-the-art measures of dependence,and several nonparametric curve estimation techniques that can be used to score pairs of variables based on how well they fit the estimatedcurve.Methods such as splines (1) and regressionestimators (1, 9, 10) tend to be equitable acrossfunctional relationships (11) but are not general: They fail to find many simple and importanttypes of relationships that are not functional.(Figures S5 and S6 depict examples of relationships of this type from existing literature, andcompare these methods to MIC on such relation-A B C D EFig. 3. Visualizations of the characteristic matrices of common relationships. (A to F) Surfaces representing the characteristic matrices of severalcommon relationship types. For each surface, the x axis represents number of vertical axis bins (rows), the y axis represents number of horizontal152016 DECEMBER 2011VOL 334axis bins (columns), and the z axis represents the normalized score of thebest-performing grid with those dimensions. The inset plots show the relationships used to generate each surface. For surfaces of additional relationships, see fig. S7.SCIENCEwww.sciencemag.orgDownloaded from www.sciencemag.org on December 16, 2011RESEARCH ARTICLES

RESEARCH ARTICLESprincipal curve–based methods (16–19, 20), distance correlation (21), and the Spearman rankcorrelation coefficient all detect broader classesof relationships. However, they are not equitableeven in the basic case of functional relationships: They show a strong preference for sometypes of functions, even at identical noise Mutual Information (Kraskov et al.)Under Five Mortality Rate400020000020,00040,000Gross Nat’l Inc / Person (Int )6020104070100Rural Access to Improved Water Sources (%)2501250Fig. 4. Application of MINE to global indicators from the WHO. (A) MICversus r for all pairwise relationships in the WHO data set. (B) Mutualinformation (Kraskov et al. estimator) versus r for the same relationships.High mutual information scores tend to be assigned only to relationshipswith high r, whereas MIC gives high scores also to relationships that arenonlinear. (C to H) Example relationships from (A). (C) Both r and MIC yieldlow scores for unassociated variables. (D) Ordinary linear relationships scorehigh under both tests. (E to G) Relationships detected by MIC but not by r,because the relationships are nonlinear (E and G) or because more than onerelationship is present (F). In (F), the linear trendline comprises a set ofwww.sciencemag.org2x1061006000Life Expectancy (Years)3dUnder Five Mortality Rate6Mutual Information 0.65-0.5300Improved Water Facilities (%)20002x105MIC 0.6510001x1056000Health Exp. / Person (US )Health Exp. / Person (Int )1512900200ge-13030000Number of PhysiciansHMIC 0.85 0.83Mutual Information f006G 60Income / Person (Int )0.54800Children Per Woman25I1c21650hPearson Correlation Coefficient (ρ)1275MIC ScoreB8Health Exp. / Person (Int )E60.254Dentist Density (per 10,000)151290603000G-110F0-0.5201600Measles Imm. Disparity (%)C30E90Downloaded from www.sciencemag.org on December 16, 20110.5Adult (Female) Obesity (%)Pearson Correlation Coefficient (ρ)HD40(Fig. 2, A and C to F). For example, at a samplesize of 250, the Kraskov et al. mutual information estimator (14) assigns a score of 3.65 to anoiseless line but only 0.59 to a noiseless sinusoid, and it gives equivalent scores to a verynoisy line (R2 0.35) and to a much cleanersinusoid (R2 0.80) (Fig. 2D). Again, theseDeaths due to HIV/AIDSC1Life Lost to Injuries (% yrs)ALife Expectancy (Years)ships.) Although these methods are not intendedto provide generality, the failure to assign highscores in such cases makes them unsuitable foridentifying all potentially interesting relationshipsin a data set.Other methods such as mutual informationestimators (12–14), maximal correlation (8, 15),SCIENCE8060400200040006000Health Exp. / Person (US )100500900Cardiovascular Disease Mortality (per 1E5)Pacific island nations in which obesity is culturally valued (33); most othercountries follow a parabolic trend (table S10). (H) A superposition of tworelationships that scores high under all three tests, presumably because themajority of points obey one relationship. The less steep minority trendconsists of countries whose economies rely largely on oil (37) (table S11).The lines of best fit in (D) to (H) were generated using regression on eachtrend. (I) Of these four relationships, the left two appear less noisy than theright two. MIC accordingly assigns higher scores to the two relationships onthe left. In contrast, mutual information assigns similar scores to the top tworelationships and similar scores to the bottom two relationships.VOL 33416 DECEMBER 20111521

1522sure MIC – r2, we found several interesting relationships (Fig. 4, E to G), many of which areconfirmed by existing literature (30–32). For example, we identified a superposition of two functional associations between female obesity andincome per person—one from the Pacific Islands,where female obesity is a sign of status (33), andone from the rest of the world, where weightand status do not appear to be linked in this way(Fig. 4F).We next explored a yeast gene expressiondata set (6223 genes) that was previously analyzed with a special-purpose statistic developedby Spellman et al. to identify genes whosetranscript levels oscillate during the cell cycle(26). Of the genes identified by Spellman et al.and

of MIT and Harvard, Cambridge, MA 02142, USA. 3Department of Statistics, University of Oxford, Oxford OX1 3TG, UK. 4De-partment of Mathematics, Harv ard College, Cambridge, MA 02138, USA. 5Department of Computer Science and Applied Mathematics, Weizmann InstituteofScience,Rehovot,Israel. 6Center for Systems Biology, Department of Organismic and

Related Documents:

Detecting Novel Associations in Large Astrophysical Data Sets Elizabeth Mart nez-G omez Mercedes T. Richardsy Donald St. P. Richardsz Abstract The distance correlation as a measure of dependence between collections of random vari-ables was introduced by Sz ekely,

SIMON, N., and TIBSHIRANI R. "comment on “detecting novel associations in large data sets” by Reshef et al, Science dec 16, 2011." Science (2011). Reshef et al (2011) Detecting Novel Associations in Large Data Sets

As well as detecting associations within known QTL regions, a number of novel associations were detected; the most notable of these was a region of chromosome 13 associated with milk yield in the population of Holstein-Friesian sires. Conclusions: A total of 276 of novel SNPs were dete

data sets, first for detection and classification of anomalies in . if higher-order associations are leveraged by the data mining algorithm: i.e., associations between records. For example, . detecting novel e

Download novel the alchemist terjemahan. The alchemist (novel) plot. What books to read after the alchemist. The alchemist graphic novel free download. The alchemist novel in urdu pdf download. The alchemist novel price. The alchemist (novel) summary.

ActivityConnection.com - National Novel Writing Month: Write Your Own Novel - Page 1 of 28 Read Write National Novel Writing Month: Write Your Own Novel In November 1999, a challenge was made: write a 50,000-word novel in just 30 days. Writer Chris Baty posed this challenge to the 21 people in his writing group. The next year, another

employers bought health insurance through associations, called Association Health Plans (AHPs). Associations, such as professional or trade associations, were often created to further common economic or policy interests and incidentally offered AHPs as a benefit to their members. Other associations, however, were

of just what a stream-of-consciousness novel really is. Numerous labels have been offered in efforts to define this unique approach: the "thought-stream novel" or 2 3 simply the "stream novel;" the "time novel;" the Shiv K. Kumar, Bergson and the Stream of Conscious ness Novel (New York, 1963), p. 2. Paul West, The Modern Novel (London, 1965 .