A Tree-based Framework For . - Kent State University

2y ago
16 Views
2 Downloads
315.58 KB
10 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Camille Dion
Transcription

A Tree-based Framework for Difference SummarizationRuoming Jin, Yuri Breitbart, Rong LiDepartment of Computer ScienceKent State UniversityKent, OH, 44242, USAEmail: g the differences between twodatasets is a fundamental data mining question and is alsoubiquitously important across many real world scientific applications. In this paper, we propose a tree-based framework toprovide a parsimonious explanation of the difference betweentwo distributions based on rigorous two-sample statistical test.We develop two efficient approaches. The first one is a dynamicprogramming approach that finds a minimal number of datasubsets that describe the difference between two data sets.The second one is a greedy approach that approximates thedynamic programming approach . We employ the well-knownFriedman’s MST (minimal spanning tree) statistics for twosample statistical tests in our summarization tree construction,and develop novel techniques to speedup its computationalprocedure. We performed a detailed experimental evaluationon both real and synthetic datasets and demonstrated theeffectiveness of our tree-summarization approach.Keywords-difference summarization; minimal spanning tree;two sample test;I. I NTRODUCTIONUnderstanding the differences between two datasets is afundamental data mining problem and it is also ubiquitouslyimportant across many real world applications and scientificdisciplines. In pharmaceutical study, doctors would like tolearn patients responses from two competitive or relateddrugs. Specifically, if the responses are statistically different,doctors need to understand the differences with respect toa list of different factors, such as blood pressure,age, etc.Frequently, they may find that in most of the parameterranges or conditions, the responses are quite similar. Thus,the key problem is how can they find a concise descriptionto summarize differences and/or similarities between theresponses to these two drugs? Similar problems can alsoappear frequently in the business domain. For instance, theoverall sale of the December of 2008 is quite different fromthe overall sale of the December of 2007. However, the important problem is to learn where customers actually spendtheir money, i.e., we need to understand how customersspend their money towards different products at two differenttime points.Despite the importance of this problem, little researchin data mining has been done to derive a summarizationframework to concisely describe the differences betweentwo datasets. Contrast mining is a closely related effort [1],[2], [3], which tries to discover the significant differencesbetween two datasets. It can discover some local differencesbetween two distributions, but it does not try to providea parsimonious explanation for the difference between twodistributions. Another work in a spirit of this problem is toprovide a parsimonious explanation of change for OLAP applications [4], [5]. Given a hierarchical framework, they tryto summarize the difference of aggregations. However, theirwork is quite limited since they focus on the aggregationwhere the hierarchical structure is already given and theirapproach cannot be applied for the generalized distributionsetting.In this paper, we provide a summarization framework todescribe differences between two data sets, each of whichconsists of points in a multidimensional space. Intuitively,suppose that we were given two sets of points: one set ofblack and the other set consists of red points, distributed ina multidimensional space. Contrary to being well-separated,these two sets of points seem intertwined with one another.Suppose that despite their similarity, using the well-knowntwo-sample test in statistical data analysis [6], we cannotaccept the hypothesis that these two datasets are beinggenerated from the same distribution. (For convenience,we say they are statistically different.) Thus, the researchproblem here is how can we concisely summarize or explainthe differences between these two sets of points?We propose here a tree-based framework to provide a parsimonious explanation of the differences between two datasets. At the high level, the tree summarization frameworkshares certain similarity with the well-known decision-treescheme. Basically, from the root, each internal node of thetree corresponds to a cut on a specific dimension, which willrecursively split the multidimensional space into two parts.However, in contrast to the decision-tree scheme, our goalis not to try to separate the two sets of data points. Instead,the tree illustrates a parsimonious description on where thedistributions differ. Specifically,1. We provide the study on the problem of constructinga concise summarization of the differences between twodistributions/datasets and formulate a tree-summarizationframework based on rigorous two-sample statistical test.2. We develop two efficient approaches to construct thesummarization tree. The first one is a dynamic programmingapproach. The other one is a greedy approach.3. In our summarization tree construction, we employ

the well-known Friedman’s MST (minimal spanning tree)statistics for two-sample statistical tests [7], and developnovel techniques to speedup the computation procedure.4. We perform a detailed experimental evaluation on bothreal and synthetic datasets and demonstrate the effectivenessand efficiency of our tree-summarization approach.II. P RELIMINARY: M ULTIVARIATE T WO -S AMPLE T ESTIn this section, we will first describe the two-sample test[6], which forms the basis of our problem formulation andour approach to its solution.A. Two-Sample TestConsider two sets of data points D1 {X1 , X2 , · · · , Xn }and D2 {Y1 , Y2 , · · · , Ym } in d-dimensional space, wherem and n are the size of datasets D1 and D2 , respectively. Weinterested in determining if they are likely to be generatedfrom the same underlying distribution. This question isaddressed in the classical two-sample test. Let Xi and Yjbe independent samples from unknown distributions F (x)and G(x), respectively. The two-sample test considers twohypotheses: a null hypothesis H0 : F (x) G(x) against thealternative hypothesis H1 : F (x) 6 G(x).A good test statistic is expected to satisfy the distributionindependent and consistent requirements [8], [9], [10], [11].An exact distribution-independent test requires that underthe null hypothesis, if min(m, n) , the test statisticdoes not depend on the unknown distribution F (x) (asymptotically distribution free), and the limiting distribution isknown. The test also needs to be consistent against thealternative H1 , i.e., when min(m, n) , the probability of rejecting the null hypothesis converges to one. Inother words, when F (x) 6 G(x), we will surely rejectthe null hypothesis as the number of samples approachesinfinity. In addition, we note that when we reject the nullhypothesis using any statistical test, we in fact prove that thedistributions are different, and thus, we can say these twodatasets are statistically different. However, in statistics, onecan never prove the distributions to be the same even with alarge amount of data. Thus, in this paper, when we are notbe able to reject the null hypothesis, we just say they are“statistically consistent” for simplicity.There are a lot of two-sample tests being developed overthe last several decades [6], [12], [11]. The majority of themare on one-dimension two sample test. To generalize themto multidimensional sets is not trivial. Also, the well-knownchi-square test is for binned distribution [8]. However, itis not a consistent test for continuous space, i.e., the chisquare test will not be able to differ two datasets if theyindeed come from different distributions when the samplesize approximates infinity.B. Friedman-Rafsky MultiVariate Two-Sample TestSeveral methods [7], [9], [10], [13] have been developedfor the two sample test in multidimensional space or simplyfor the multivariate two sample test. Among them, theFriedman-Rafsky test[7] is one of the most well-known andcomputationally efficient multivariate two sample test. Thebasic idea of this approach is to first construct a minimumspanning tree (MST) for all the data points in D1 D2in the multi-dimensional Euclidean Space. Then, to removeall the edges in the MST whose two adjacent nodes (datapoints) come from different datasets, i.e., one from D1 andanother from D2 . Thus, the MST becomes a forest withmany disjoint subtrees. In particular, all the subtrees (alsoreferred to as runs) in the forest contain the same type ofnodes (data points from the same dataset). Finally, the teststatistic Rm,n is the total number of those disjoint subtreesin the resulting forest.(a)Figure 1.(b)(a) global MST and (b) the runs of MSTThese concepts are illustrated in Figures 1. Figure 1(a)shows the global MST of two samples of 8 points (each pointin one sample is assigned with same labels), while Figure1(b) shows the result after removing edges that liking nodesfrom different samples. Note that a run is a disjoint subtreein the forest whose data points are all from the same dataset. Intuitively, a rejection of H0 corresponds to the casewhen we have a relatively small number of subtrees (runs).In other words, the two sets of points are relatively wellseparated.Formally, it has been proven that for large sample size(min(m, n) and m/(m n) p 6 0, under the H0 ,the distribution of 2mn) /((m n)1/2 σd ) (1)W (Rm,n ) Rm,n (1 m napproaches the standard normal distribution, where σd2 r(r 21 V ar(Dd )(1 2r) (r 2p(1 p) and V ar(Dd )is the variance of the degree of any vertex in the MSTand can be easily approximated [14]). Given this, for asignificance level α, we can reject the null hypothesis ifW zα where P (x zα ) α, zα is the critical value forthe normal distribution. Using this procedure, we can expectthat all the internal nodes are statistically different and mostof the leaf nodes are statistically consistent (cannot reject thenull hypothesis) (or min(m, n) t, where t is pre-definedthreshold).Friedman-Rafsky’s MST statistic test is a clever extensionof the univariate Wald-Wolfowitz test[11] (just imagine thetree is projected into one dimension space and becomesa line and a run is simply a set of consecutive points all

from the same datasets). It has been proven that FriedmanRafsky’s MST statistical test is distribution-independent andconsistent [14], [7]. It is also a rather computationalefficient test procedure since several methods can constructthe minimal spanning tree in the Euclidean space in almostO(N log N ) computational complexity, where N m nis the total number of data points [15], [16], [17].We note that the multivariate two sample test is the basisof our summarization framework to describe the statisticaldifference between two datasets. As we will see, our treebased summarization approach can in general utilize anyavailable multivariate two-sample tests. But most of themdo not have the nice properties of being distribution independent, consistent, and computationally efficient as theFriedman-Rafsky’s MST statistic test. Thus, in this work, wefocus on the Friedman-Rafsky’s MST statistic test methodto describe statistical differences between two datasets.Since our summarization-tree will repeatedly invoke the testprocedure, we will also develop novel methods to furtherspeedup its computation (Section V).III. T HE S UMMARIZATION F RAMEWORK : AT REE - BASED S CHEMEIn this section, we introduce our tree-based frameworkto provide a parsimonious description of the differencesbetween two datasets.Summarization Tree: A summarization tree at the highlevel is a binary tree which recursively partitions the multidimensional space into smaller regions. Each node of the treecorresponds to a region in the multidimensional space, andall the data points in both datasets D1 and D2 that belong tothat region. The root has the entire multidimensional space(Rd ). The region associated with each node is describedrecursively through a cut associated with each internal node.Specifically, the cut of each internal node vi is denoted as[Ai : xi ], where Ai is the cut attribute and xi is the cutvalue. This cut will split the region and all the data pointsin that region into two parts and each part belong to one ofthe internal node’s two children: the left child has all thedata points of its parent whose Ai dimension is less than orequal to the xi (Ai xi ) and the right child has those datapoints whose Ai dimension is bigger than xi (Ai xi ).Finally, the leaf nodes do not have any cut.We use this tree structure to describe the differencesbetween any two datasets D1 and D2 . The tree has thefollowing properties:1) Internal Nodes: Let D1 [vi ] and D2 [vi ] be the setsof data points belonging to the region described byinterval node vi . For any internal node vi , we canreject the null hypothesis that D1 [vi ] and D2 [vi ] aregenerated from the same distribution. In other words,the associated two sets of data points of any internalnode from D1 and D2 are statistically different.2) Leaf Nodes: The sets of data points associated withleaf node vi , D1 [vi ] and D2 [vi ], either are statisticallyconsistent, or too small to be further partitioned. In thelatter case, a lower bound on the minimal number ofdata points in each set is defined to make the statisticaltest meaningful.At each node vi , we need to run a statistical testto see if its associated sets of data points, D1 [vi ]and D2 [vi ] statistically come from the same distribution or not. In other words, we have a null hypothesis FD1 [vi ] (x) GD2 [vi ] (x) against the alternativeFD1 [vi ] (x) 6 GD2 [vi ] (x). We apply the aforementioned Friedman-Rafsky’s MST two sample test ofSubsection II-B for this purpose.3) Cut: Each cut is associated with the internal nodevi describes a summarized view of the associateddatasets, D1 [vi ] and D2 [vi ]. Specifically, for eachinternal node vi , its cut [Ai : xi ] splits its associateddata, D1 [vi ] and D2 [vi ] into two parts: let vj be its leftchild and vk be its right child. Then, we have D1 [vj ]and D2 [vj ] for the left child and D1 [vk ] and D2 [vk ]for the right child. To understand the summarized viewof its left child and right child , we assume that thecut creates two bins for the associated data, D1 [vi ] andD2 [vi ]. In other words, we have a 2 2 contingencytable, where each row corresponds to a dataset, andthe columns correspond to the bins.D1D2Left Child (vj ): Ai xi D1 [vj ] D2 [vj ] Right Child (vk ): Ai xi D1 [vk ] D2 [vk ] Table IS TATISTICAL T EST FOR THE S UMMARIZED V IEWWe can use the chi-square test to determine whetherthe quantities in the first row and the quantities in thesecond row come from the same distribution [8]. If wecan reject that they come from the same distribution,we say the cut is a dependent cut. Otherwise, we saythe cut is an independent cut.The main intuition and/or motivation of this framework is based on the observation that two relateddatasets/distributions often tend to be the similar in mostparts of space. However, they differ either because there isa shift of data distribution from one part of the space toanother part of the space, or there is a hot spot or area inthe space. Those are likely to be the events resulting into thedifferences between two datasets. For instance, consideringin the business example, the sale of this December is closeto the sale of the last December because the customers intotal spend less money on the luxury products, such as DVDor games, but their purchase distribution is still similar ifwe exclude such high level difference. Indeed, the decisiontree construction allows us to focus on the local regions andwhen we do the two-sample tests, the global effect of thedifference is isolated and does not affect the two-sample teston the local regions. Specifically, the two sample test on theD1 [vi ] and D2 [vi ] is totally independent from the the restof data points, i.e., D1 \D1 [vi ] and D2 \D2 [vi ].

We also note that in the traditional statistical analysis, theglobal distribution difference is typically captured throughthe mean-shift or scaling.Here, we assume that both distributions are properlynormalized and thus we do not need to handle such difference explicitly. Besides, in this framework, we do notconsider the multiple comparison/inference problem [6] aswe treat each two-sample test independently, and we are notinterested in deriving an overall statistical significance forthe entire summarization-tree. However, if this is preferred,i.e., we would like to treat all the two-sample tests in asummarization-tree as a whole, then we can utilize the methods, such as Bonferroni correction, to adjust the significancelevel of each test.Minimal Summarization Tree Problem: Given this, weintroduce the minimal difference summarization tree (MDST)problem. Let T be a difference summarization tree. Weassign each node in the tree a cost. The main idea is thatif we cannot reject two distributions are same (statisticallyconsistent), we do not need to output the description of thetwo distributions. For other cases, we need to output certaininformation to describe their differences. Those informationwill be associated with each node in the tree and the costsof each node reflect the minimal description length for suchinformation.For each internal node vi , if the cut is dependent, thecost of the node cost(vi ) 6; if the cut is independent,the cost of the node cost(vi ) 2. This is because if itis independent cut, we only need to record two values:the cut point and the dimension, while if it is dependentcut, we need to record six values: the cut point, the dimension, and the summarized distribution for chi squaretest. For each leaf node vi , if we cannot reject that itsassociated two datasets D1 [vi ] and D2 [vi ] come from thesame distribution, its cost cost(vi ) 0, and otherwise,cost(vi ) number of points in the leaf node. However,there are some cases need to be considered carefully. Oneis that if in the leaf node, the number of points fromone dataset is too small, let us say 5, we assign the leafnode cost(vi ) number of points in the leaf node.Another case is that if the number of points from one datasetis just a very small proportion of the number of pointsfrom another dataset (e.g. 5%), then applying the statisticstest has no meaning, too. Thus we assign the leaf nodecost(vi ) number of points f rom smaller dataset inthe leaf node. Given this, the overall description for treeT is simplyPthe sum of all the costs from its nodes,Cost(T ) vi T cost(vi ). (The cost of each node canbe modified according to different applications.)The example of calculating the cost and constructing thetree is given in the experiment results section. See Figure.3and Figure.4.Definition 1: (Minimal Difference Summarization Tree(MDST) Problem) Given two datasets D1 and D2 , themost parsimonious description of the differences betweentwo datasets D1 and D2 is the difference summarization treeT which has the minimal cost, i.e., the minimal differencesummarization tree:arg min Cost(T )TIV. A LGORITHMS FOR T REE C ONSTRUCTIONIn this section, we introduce two approaches to findthe minimal difference summarization tree (M DST ) fortwo datasets D1 and D2 . The first approach is a dynamicprogramming approach, which can guarantee to find MDSTbut is computationally prohibitive. The second approach isa greedy approach.In both dynamic programming approach and greedy approach, we will prune the subtree if its total cost is higherthan the root node cost (without the subtree), and replacethe subtree with a leaf node to indicate the correspondingregion is statistically different.Finally, we note that in both approaches, we need toinvoke the Friedman-Rafsky two sample test. Since it needsto repetitively construct minimal spanning tree on eachspecific region, this becomes very expensive. In the nextsection, we will introduce two novel techniques which cansignificantly reduce such a cost.A. Dynamic Programming for MDSTDifferent from the classical decision tree construction [18], we will first show MDST problem can be solvedby dynamic programming in polynomial time. Then, weintroduce heuristics to reduce the computational cost fordynamic programming by discretization.To facilitate our discussion, we introduce the followingnotation. Let [b1 , e1 ] : [b2 , e2 ] : · · · : [bd , ed ] be a ddimensional cube in Rd , and let D1 [[b1 , e1 ] : · · · : [bd , ed ]]and D2 [[b1 , e1 ] : · · · : [bd , ed ]] be two sets of pointsassociated with the region from D1 and D2 , respectively.Clearly, each node in the tree is associated with such a cubeand its corresponding datasets. The possible cutting pointis simply the median point between any two consecutivevalues on each dimension. Specifically, for dimension i, wecan project all the data points in D1 [[b1 , e1 ] : · · · : [bd , ed ]]and D2 [[b1 , e1 ] : · · · : [bd , ed ]] on that, and suppose in thei-dimension, we have the unique points y1 y2 · · · , yt .Then, the cut points are yi y2 i 1 , where 1 i t 1.Further, let T [b1 , e1 ] : · · · : [bd , ed ]) be a differencesummarization tree on the cube [b1 , e1 ] : · · · : [bd , ed ]. Letyi,j denote the j-th cut point for i-th dimension. Then, thebasic observation for the MDST is the following recursiveformulation:min Cost(T ([b1 , e1 ] : · · · : [bd , ed ])T min Cut(yi,j [b1 , e1 ] : · · · : [bd , ed ]) i,jmin Cost(T ([b1 , e1 ] : · · · : [bi , yi,j ] : · · · : [bd , ed ]) Tmin Cost(T ([b1 , e1 ] : · · · : [yi,j , ei ] : · · · : [bd , ed ])(2)T

Here, the Cut(yi,j [b1 , e1 ] : · · · : [bd , ed ]) is the cost of thesummarization difference on the two datasets (Section III).Using the chi-square test, if the cut is dependent, then, thecut cost is 6, and otherwise, the cost is 2. Finally we notethat for the root node, bi and ei , is simply the smallest andlargest value in the i-th dimension.Algorithm 1 Dynamic Programming for MDSTProcedure MinTreeCost([b1 , e1 ], · · · , [bk , ek ])Require: Datasets D1 and D21: if Find Cube([b1 , e1 ] : · · · : [bk , ek ]){computed before} then2:return Cost([b1 , e1 ] : · · · : [bk , ek ])3: end if4: if ( D1 [[b1 , e1 ] : · · · : [bk , ek ]] / D2 [[b1 , e1 ] : · · · : [bk , ek ]] b) ( D2 [[b1 , e1 ] : · · · : [bk , ek ]] / D1 [[b1 , e1 ] : · · · :[bk , ek ]] b) {b is the lower bound for the statistical tests}then5:Cost([b1 , e1 ] : · · · : [bk , ek ]) min( D1 [[b1 , e1 ] : · · · :[bk , ek ]] , D2 [[b1 , e1 ] : · · · : [bk , ek ]] ) {treat as dependentcut}6:return Cost([b1 , e1 ] : · · · : [bk , ek ])7: end if8: if min( D1 [[b1 , e1 ] : · · · : [bk , ek ]] , D2 [[b1 , e1 ] : · · · :[bk , ek ]] ) t {t is the lower bound for the statistical tests}then9:Cost([b1 , e1 ] : · · · : [bk , ek ]) ( D1 [[b1 , e1 ] : · · · :[bk , ek ]] D2 [[b1 , e1 ] : · · · : [bk , ek ]] ) {treat as dependentcut}10:return Cost([b1 , e1 ] : · · · : [bk , ek ])11: end if12: if NOT Test-Diff(D1 [[b1 , e1 ] : · · · : [bk , ek ]], D2 [[b1 , e1 ] : · · · :[bk , ek ]]) {statistically consistent using two sample test} then13:Cost([b1 , e1 ] : · · · : [bk , ek ]) 014:return 015: end if16: M inCost ( D1 [[b1 , e1 ] : · · · : [bk , ek ]] D2 [[b1 , e1 ] :· · · : [bk , ek ]] )17: for all dimension i, 1 i d do18:for all cut point yi,j in i-dimension do19:L [b1 , e1 ] : · · · : [bi , yi,j ] : · · · : [bd , ed ]20:R [b1 , e1„] : · · · : [yi,j , ei ] : · · · :«[bd , ed ] D1 [L] D1 [R] 21:Cut χ2 D2 [L] D2 [R] 22:M inCost min(M inCost,Cut M inT reeCost(L) M inT reeCost(R))23:end for24: end for25: Cost([b1 , e1 ] : · · · : [bk , ek ]) M inCost26: return M inCostThe algorithm description of the dynamic programming isshown in Algorithm 1. For each given cube, this algorithmwill first check if the difference summarization tree has beencomputed before (Line 1) or if it does not have enough datafor the two sample statistical tests (Line 4 and Line 8), in thethree cases, we directly return its cost. Then, we will performa two-sample test on the two datasets in the cube (Line 12).If we cannot reject they come from the same distribution, wewill assign this cube cost 0 (a possible leaf node), and returnthe cost immediately. Otherwise, we find the two datasets arestatistically different, and then, first we assume that the mincost of this cube is the total number of points in this cube(Line 16), this is used for pruning the subtree. Later we needto try all possible cuts on each dimension (Line 17 and 18),and apply the aforementioned recursive formula to find theminimal cost for the difference summarization tree on thiscube (Line 21 and 22). It is easy to see that this approachguarantees to find the minimal difference summarization treeon datasets D1 and D2 . Its computational cost in the worstcase is O((N 3d N 2d N 2 ), where N 3d is the cost ofsearching on all possible subproblems and N 2d N 2 isthe total cost of invoking the two-sample test, which inthe worst case is O(N 2 ) considering we use the FriedmanRafsky’s MST test (N D1 D2 m n). It alsoneeds O(N 2d ) to memorize the intermediate computation(all cubes). When N and d are large this procedure isapparently computationally prohibitive.To make it more computationally feasible, we can tryto avoid looking at all possible cut points. Instead, wewould like to look at candidates that are more promising toproduce a small different summarization tree. In this work,we employ a discretization preprocessing step to procedurea fixed number (K) of intervals at each dimension on theentire datasets, and then we only look at the discretizationpoints as the possible cutting points. This can reduce thecomputational cost to O((K 3d K 2d N 2 ). In addition, wenote that many discretization methods have been developedfor both supervised and unsupervised learning. Here, weapply the Chi-merge [19] procedure which can merge twostatistically independent intervals together and each datasetis used as a class label in the discretization step.B. Greedy Construction AlgorithmClearly, even with the discretization, the dynamic programming can be prohibitive at high dimension datasets(d is large). Thus, our second algorithm employs a greedyconstruction procedure similar to the classical decision treeclassifier construction. At each step from the root node, itgreedily finds an optimal cut according to certain criteria.Then it splits both datasets into two parts, respectively,and constructs two children based on the cut. Then, it willrecursively split each child until we cannot reject that theassociated datasets are generated from the same distribution,or not enough sample data points for the two-sample test.However, the key problem is how to find the optimal cutfor a given node and its associated datasets. This problem isclearly different from the classical decision tree constructionwhere they try to split the data into well-separated classes.Typically, either the information entropy or gini index is usedto determine the best cut. Here, however, the two datasetsare most likely inseparable from one another, and our goalis to provide the concise summarization of their differences.Thus, a key insight is that at each node, we should try to splitits corresponding datasets into two parts, where both datasetsat each part are more likely to be statistically consistent. Inother words, if the two datasets are statistically differentglobally on the cube, we can find where they differ most

and then select that point as the splitting point.Our intuition can be formalized by the two-sampleKolmogorov-Smirnov (K-S) test statistic, which is one ofthe most widely used test for one-dimension two sampletest [6]. The K-S test statistic is defined on the top of theempirical distribution functions from datasets D1 and D2 .For any real value t,number of Xi , Xi tFn (t) nandGm (t) number of Yi , Yi tmThe functions Fn (t) and Gm (t) are the empirical marginaldistributions on any dimension of the two sets D1 and D2 ,respectively. Given this, the K-S statistic is defined as themaximum difference between the empirical distributions:D max t Fn (t) Gm (t) K-S statistic test is distribution-free, consistent, and moreimportantly, it is invariant under any local slide or stretchingas long as the rank of these data points are unchanged alongthe specific dimension.Given this, our greedy algorithm finds K S statisticat each dimension and chooses the cut which results in themaximal K S statistic. Thus, we find the largest differencefor each marginal distribution and our cut will help toreduce such a difference. Our overall greedy algorithm isin Algorithm IV-B.Algorithm 2 Greedy Programming for MDSTProcedure GreedyConstruction([b1 , e1 ], · · · , [bk , ek ])Require: Datasets D1 and D21: if ( D1 [[b1 , e1 ] : · · · : [bk , ek ]] / D2 [[b1 , e1 ] : · · · : [bk , ek ]] b) ( D2 [[b1 , e1 ] : · · · : [bk , ek ]] / D1 [[b1 , e1 ] : · · · :[bk , ek ]] b) {b is the lower bound for the statistical tests}then2:construct leaf node;3:return ;4: end if5: if min( D1 [[b1 , e1 ] : · · · : [bk , ek ]] , D2 [[b1 , e1 ] : · · · :[bk , ek ]] ) t {t is the lower bound for the statistical tests}then6:construct leaf node;7:return ;8: end if9: if NOT Test-Diff(D1 [[b1 , e1 ] : · · · : [bk , ek ]], D2 [[b1 , e1 ] : · · · :[bk , ek ]]) {statistically consistent using two sample test} then10:construct leaf node;11:return ;12: end if13: M axD 014: for all dimension i, 1 i d do15:M axD max(M axD, K S(D1 , D2 , [bi , ei ]));16: end for{recursively construct the difference summarizationtree}17: yi,j the cut with M axD;18: GreedyConstruction([b1 , e1 ], · · · , [bi , yi,j ], · · · , [bk , ek ]);19: GreedyConstruction([b1 , e1 ], · · · , [yi,j , ei ], · · · , [bk , ek ])The worst case time complexity of this algorithm is

Kent State University Kent, OH, 44242, USA Email: {jin,yuri,rli0}@cs.kent.edu Abstract—Understanding the differences between two datasets is a fundamental data mining question and is also ubiquitously important across many real world scientific ap-plications. In this pap

Related Documents:

Civic Style - Marker Symbols Ü Star 4 û Street Light 1 ú Street Light 2 ý Tag g Taxi Æb Train Station Þ Tree 1 òñðTree 2 õôóTree 3 Ý Tree 4 d Truck ëWreck Tree, Columnar Tree, Columnar Trunk Tree, Columnar Crown @ Tree, Vase-Shaped A Tree, Vase-Shaped Trunk B Tree, Vase-Shaped Crown C Tree, Conical D Tree, Conical Trunk E Tree, Conical Crown F Tree, Globe-Shaped G Tree, Globe .

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

Family tree File/directory tree Decision tree Organizational charts Discussion You have most likely encountered examples of trees: your family tree, the directory . An m-ary tree is one in which every internal vertex has no more than m children. 2. A full m-ary tree is a tree in which every

Search Tree (BST), Multiway Tree (Trie), atau Ternary Search Tree (TST). Pada makalah ini kita akan memfokuskan pembahasan pada Ternary Search Tree yang bisa dibilang menggabungkan sifat-sifat Binary Tree dan Multiway Tree. . II. DASAR TEORI II.A TERNARY SEARCH TREE Ternary Search Tr