Lecture Notes For Chapter 7 Introduction To Data Mining

1y ago
13 Views
2 Downloads
1.42 MB
49 Pages
Last View : 4m ago
Last Download : 2m ago
Upload by : Francisco Tran
Transcription

Data MiningCluster Analysis: Basic Conceptsand AlgorithmsLecture Notes for Chapter 7Introduction to Data MiningbyTan, Steinbach, KumarIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar1What is Cluster Analysis? Given a set of objects, place them in groups such that theobjects in a group are similar (or related) to one another anddifferent from (or unrelated to) the objects in other groupsIntra-clusterdistances areminimized3/24/20212Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, KumarInter-clusterdistances aremaximized2

Applications of Cluster Analysis Understanding– Group related documentsfor browsing, group genesand proteins that havesimilar functionality, orgroup stocks with similarprice fluctuations Discovered N,Compaq-DOWN, EMC-Corp-DOWN, A-Corp-DOWN,Morgan-Stanley-DOWNIndustry hlumberger-UPOil-UPSummarization– Reduce the size of largedata setsClustering precipitationin Australia3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar33Notion of a Cluster can be AmbiguousHow many clusters?Six ClustersTwo ClustersFour Clusters3/24/20214Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar4

Types of Clusterings A clustering is a set of clusters Important distinction between hierarchical andpartitional sets of clusters– Partitional Clustering A division of data objects into non-overlapping subsets (clusters)– Hierarchical clustering 3/24/2021A set of nested clusters organized as a hierarchical treeIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar55Partitional ClusteringOriginal Points3/24/20216A Partitional ClusteringIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar6

Hierarchical Clusteringp1p3p4p2p1 p2Traditional Hierarchical Clusteringp3 p4Traditional Dendrogramp1p3p4p2p1 p2Non-traditional Hierarchical Clustering3/24/2021p3 p4Non-traditional DendrogramIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar77Other Distinctions Between Sets of Clusters Exclusive versus non-exclusive– In non-exclusive clusterings, points may belong to multipleclusters. Can belong to multiple classes or could be ‘border’ points– Fuzzy clustering (one type of non-exclusive) In fuzzy clustering, a point belongs to every cluster with some weightbetween 0 and 1Weights must sum to 1Probabilistic clustering has similar characteristicsPartial versus complete– In some cases, we only want to cluster some of the data3/24/20218Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar8

Types of Clusters Well-separated clusters Prototype-based clusters Contiguity-based clusters Density-based clusters Described by an Objective Function3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar99Types of Clusters: Well-Separated Well-Separated Clusters:– A cluster is a set of points such that any point in a cluster iscloser (or more similar) to every other point in the cluster thanto any point not in the cluster.3 well-separated clusters3/24/202110Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar10

Types of Clusters: Prototype-Based Prototype-based– A cluster is a set of objects such that an object in a cluster iscloser (more similar) to the prototype or “center” of a cluster,than to the center of any other cluster– The center of a cluster is often a centroid, the average of allthe points in the cluster, or a medoid, the most “representative”point of a cluster4 center-based clusters3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar1111Types of Clusters: Contiguity-Based Contiguous Cluster (Nearest neighbor orTransitive)– A cluster is a set of points such that a point in a cluster iscloser (or more similar) to one or more other points in thecluster than to any point not in the cluster.8 contiguous clusters3/24/202112Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar12

Types of Clusters: Density-Based Density-based– A cluster is a dense region of points, which is separated bylow-density regions, from other regions of high density.– Used when the clusters are irregular or intertwined, and whennoise and outliers are present.6 density-based clustersIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20211313Types of Clusters: Objective Function Clusters Defined by an Objective Function– Finds clusters that minimize or maximize an objective function.– Enumerate all possible ways of dividing the points into clusters andevaluate the goodness' of each potential set of clusters by usingthe given objective function. (NP Hard)–Can have global or local objectives. Hierarchical clustering algorithms typically have local objectives Partitional algorithms typically have global objectives– A variation of the global objective function approach is to fit thedata to a parameterized model.3/24/202114 Parameters for the model are determined from the data. Mixture models assume that the data is a ‘mixture' of a number ofstatistical distributions.Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar14

Characteristics of the Input Data Are Important Type of proximity or density measure– Central to clustering– Depends on data and application Data characteristics that affect proximity and/or density are– Dimensionality Sparseness– Attribute type– Special relationships in the data For example, autocorrelation– Distribution of the data Noise and Outliers– Often interfere with the operation of the clustering algorithm Clusters of differing sizes, densities, and shapes3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar1515Clustering Algorithms K-means and its variants Hierarchical clustering Density-based clustering3/24/202116Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar16

K-means Clustering Partitional clustering approach Number of clusters, K, must be specified Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closestcentroid The basic algorithm is very simpleIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20211717Example of K-means ClusteringIteration 61234532.52y1.510.50-2-1.5-1-0.50x180.511.52

Example of K-means ClusteringIteration 1Iteration 2Iteration -0.500.511.520-2-1.5-1-0.5x00.511.52-2Iteration 4Iteration ation uction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar1919K-means Clustering – Details Simple iterative algorithm.– repeat {assign each point to a nearest centroid; re-compute cluster centroids}–until centroids stop changing.Initial centroids are often chosen randomly.–Clusters produced can vary from one run to another The centroid is (typically) the mean of the points in the cluster,but other definitions are possible (see Table 7.2). K-means will converge for common proximity measures withappropriately defined centroid (see Table 7.2) Most of the convergence happens in the first few iterations.– Often the stopping condition is changed to ‘Until relatively few pointschange clusters’Complexity is O( n * K * I * d )–3/24/202120Choose initial centroids;–n number of points, K number of clusters,I number of iterations, d number of attributesIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar20

K-means Objective Function A common objective function (used with Euclideandistance measure) is Sum of Squared Error (SSE)– For each point, the error is the distance to the nearest clustercenter– To get SSE, we square these errors and sum them.KSSE dist 2 ( mi , x )i 1 x Ci– x is a data point in cluster Ci and mi is the centroid (mean) forcluster Ci– SSE improves in each iteration of K-means until it reaches alocal or global minima.Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20212121Two different K-means Clusterings32.5Original 00.511.52xOptimal Clustering3/24/2021-1Sub-optimal ClusteringIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar22

Importance of Choosing Initial Centroids Iteration nce of Choosing Initial Centroids Iteration 1Iteration -2-1.5-1-0.5xIteration 52y2.52y2.5y3-1.51.5Iteration 53-21Iteration 430240x11.520-2-1.5-1-0.500.511.52xIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar-2-1.5-1-0.500.511.52x24

Importance of Choosing Intial Centroids Depending on thechoice of initialcentroids, B and Cmay get merged orremain separate3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar2525Problems with Selecting Initial Points If there are K ‘real’ clusters then the chance of selectingone centroid from each cluster is small.–Chance is relatively small when K is large–If clusters are the same size, n, then–For example, if K 10, then probability 10!/1010 0.00036–Sometimes the initial centroids will readjust themselves in‘right’ way, and sometimes they don’t–Consider an example of five pairs of clusters3/24/202126Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar26

10 Clusters ExampleStarting with two initial centroids in one cluster of each pair of clustersIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/2021272710 Clusters ExampleIteration 28664422yyIteration 1800-2-2-4-4-6-60510152005x15201520Iteration 48664422yy10xIteration 3800-2-2-4-4-6-60510152005x10xStarting with two initial centroids in one cluster of each pair of clusters3/24/202128Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar28

10 Clusters ExampleStarting with some pairs of clusters having three initial centroids, while otherhave only one.Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/2021292910 Clusters ExampleIteration 28664422yyIteration ting with some pairs of clusters having three initial centroids, while other have only one.3/24/202130Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar30

Solutions to Initial Centroids Problem Multiple runs– Helps, but probability is not on your side Use some strategy to select the k initial centroidsand then select among these initial centroids– Select most widely separated K-means is a robust way of doing this selection– Use hierarchical clustering to determine initialcentroids Bisecting K-means– Not as susceptible to initialization issuesIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20213131K-means This approach can be slower than random initialization,but very consistently produces better results in terms ofSSE– To select a set of initial centroids, C, perform the following1.Select an initial point at random to be the first centroid2.For k – 1 steps3.For each of the N points, xi, 1 i N, find the minimum squareddistance to the currently selected centroids, C1, , Cj, 1 j k,i.e.,min d2( Cj, xi )4.Randomly select a new centroid by choosing a point with probabilityd2( Cj, xi )proportional tois d2( Cj, xi )5.End For3/24/202132The k-means algorithm guarantees an approximation ratioO(log k) in expectation, where k is the number of centersIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar32

Bisecting K-means Bisecting K-means algorithm–Variant of K-means that can produce a partitional or ahierarchical clusteringCLUTO: iew3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3333Bisecting K-means Example3/24/202134Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar34

Limitations of K-means K-means has problems when clusters are ofdiffering– Sizes– Densities– Non-globular shapes K-means has problems when the data containsoutliers.– One possible solution is to remove outliers beforeclustering3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3535Limitations of K-means: Differing SizesOriginal Points3/24/202136K-means (3 Clusters)Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar36

Limitations of K-means: Differing DensityK-means (3 Clusters)Original Points3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3737Limitations of K-means: Non-globular ShapesOriginal Points3/24/202138K-means (2 Clusters)Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar38

Overcoming K-means LimitationsOriginal PointsK-means ClustersOne solution is to find a large number of clusters such that each of them represents a part of anatural cluster. But these small clusters need to be put together in a post-processing step.3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3939Overcoming K-means LimitationsOriginal PointsK-means ClustersOne solution is to find a large number of clusters such that each of them represents a part of anatural cluster. But these small clusters need to be put together in a post-processing step.3/24/202140Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar40

Overcoming K-means LimitationsOriginal PointsK-means ClustersOne solution is to find a large number of clusters such that each of them represents a part ofa natural cluster. But these small clusters need to be put together in a post-processing step.Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20214141Hierarchical ClusteringProduces a set of nested clusters organized as ahierarchical tree Can be visualized as a dendrogram – A tree like diagram that records the sequences ofmerges or ntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar42

Strengths of Hierarchical Clustering Do not have to assume any particular number ofclusters– Any desired number of clusters can be obtained by‘cutting’ the dendrogram at the proper level They may correspond to meaningful taxonomies– Example in biological sciences (e.g., animal kingdom,phylogeny reconstruction, )Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20214343Hierarchical Clustering Two main types of hierarchical clustering– Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster(or k clusters) left– Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains an individualpoint (or there are k clusters)Traditional hierarchical algorithms use a similarity ordistance matrix– Merge or split one cluster at a time3/24/202144Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar44

Agglomerative Clustering Algorithm Key Idea: Successively merge closest clusters Basic algorithm1.Compute the proximity matrix2.Let each data point be a cluster3.Repeat4.5.6. Merge the two closest clustersUpdate the proximity matrixUntil only a single cluster remainsKey operation is the computation of the proximity of two clusters–3/24/2021Different approaches to defining the distance between clustersdistinguish the different algorithmsIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar4545Steps 1 and 2 Start with clusters of individual points and aproximity matrixp1 p2p3p4 p5.p1p2p3p4p5.3/24/202146Proximity MatrixIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar46

Intermediate Situation After some merging steps, we have some clustersC1C2C3C4C5C1C2C3C3C4C4C5Proximity MatrixC1C5C2Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20214747Step 4 We want to merge the two closest clusters (C2 and C5) andupdate the proximity matrix.C1 C2C3C4 C5C1C2C3C3C4C4C5Proximity MatrixC1C23/24/202148C5Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar48

Step 5 The question is “How do we update the proximity matrix?”C2UC5C1C1?C2 U C5C3C4C3C4?C3?C4?Proximity MatrixC1C2 U C53/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar4949How to Define Inter-Cluster Distancep1Similarity?p2p3p4 p5.p1p2p3p4 p5MIN.MAX.Group Average.Proximity MatrixDistance Between CentroidsOther methods driven by an objectivefunction– Ward’s Method uses squared error3/24/202150Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar50

How to Define Inter-Cluster Similarityp1p2p3p4 p5.p1p2p3p4 p5MIN.MAX.Group Average.Proximity MatrixDistance Between CentroidsOther methods driven by an objectivefunction– Ward’s Method uses squared error3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar5151How to Define Inter-Cluster Similarityp1p2p3p4 p5.p1p2p3p4 p5MIN.MAX.Group Average.Proximity MatrixDistance Between CentroidsOther methods driven by an objectivefunction– Ward’s Method uses squared error3/24/202152Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar52

How to Define Inter-Cluster Similarityp1p2p3p4 p5.p1p2p3p4 p5MIN.MAX.Group Average.Proximity MatrixDistance Between CentroidsOther methods driven by an objectivefunction– Ward’s Method uses squared error3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar5353How to Define Inter-Cluster Similarityp1p2p3p4 p5.p1 p2p3p4 p5MIN.MAX.Group Average.Proximity MatrixDistance Between CentroidsOther methods driven by an objectivefunction– Ward’s Method uses squared error3/24/202154Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar54

MIN or Single Link Proximity of two clusters is based on the twoclosest points in the different clusters– Determined by one pair of points, i.e., by one link in theproximity graph Example:Distance Matrix:Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20215555Hierarchical Clustering: MIN1350.221234560.1560.10.0540Nested Clusters3/24/20215362541DendrogramIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar56

Strength of MINOriginal PointsSix Clusters Can handle non-elliptical shapes3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar5757Limitations of MINTwo ClustersOriginal Points Sensitive to noise3/24/202158Three ClustersIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar58

MAX or Complete Linkage Proximity of two clusters is based on the twomost distant points in the different clusters– Determined by all pairs of points in the two clustersDistance Matrix:Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20215959Hierarchical Clustering: MAX41250.40.35520.30.25336140.20.150.10.050Nested Clusters3/24/202160364125DendrogramIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar60

Strength of MAXOriginal PointsTwo Clusters Less susceptible to noise3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar6161Limitations of MAXOriginal PointsTwo Clusters Tends to break large clusters Biased towards globular clusters3/24/202162Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar62

Group Average Proximity of two clusters is the average of pairwise proximitybetween points in the two clusters. proximity(p , p )iproximity(Clusteri , Clusterj ) jpi Clusterip j Clusterj Clusteri Clusterj Distance Matrix:Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20216363Hierarchical Clustering: Group Average541250.250.220.1536143Nested ction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar64

Hierarchical Clustering: Group Average Compromise between Single and CompleteLink Strengths– Less susceptible to noise Limitations– Biased towards globular clusters3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar6565Cluster Similarity: Ward’s Method Similarity of two clusters is based on the increasein squared error when two clusters are merged– Similar to group average if distance between points isdistance squared Less susceptible to noise Biased towards globular clusters Hierarchical analogue of K-means– Can be used to initialize K-means3/24/202166Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar66

Hierarchical Clustering: s Method2333324541512641252Group Average3161443Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar6767Hierarchical Clustering: Time and Space requirements O(N2) space since it uses the proximity matrix.– N is the number of points. O(N3) time in many cases– There are N steps and at each step the size, N2,proximity matrix must be updated and searched– Complexity can be reduced to O(N2 log(N) ) time withsome cleverness3/24/202168Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar68

Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters,it cannot be undone No global objective function is directly minimized Different schemes have problems with one ormore of the following:– Sensitivity to noise– Difficulty handling clusters of different sizes and nonglobular shapes– Breaking large clusters3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar6969Density Based Clustering Clusters are regions of high density that areseparated from one another by regions on lowdensity.3/24/202170Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar70

DBSCAN DBSCAN is a density-based algorithm.–Density number of points within a specified radius (Eps)–A point is a core point if it has at least a specified number ofpoints (MinPts) within Eps These are points that are at the interior of a cluster Counts the point itself–A border point is not a core point, but is in the neighborhoodof a core point–A noise point is any point that is not a core point or a borderpoint3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar7171DBSCAN: Core, Border, and Noise PointsMinPts 73/24/202172Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar72

DBSCAN: Core, Border and Noise PointsOriginal PointsPoint types: core,border and noiseEps 10, MinPts 43/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar7373DBSCAN Algorithm Form clusters using core points, and assignborder points to one of its neighboring clusters1: Label all points as core, border, or noise points.2: Eliminate noise points.3: Put an edge between all core points within a distance Eps of eachother.4: Make each group of connected core points into a separate cluster.5: Assign each border point to one of the clusters of its associated corepoints3/24/202174Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar74

When DBSCAN Works WellOriginal PointsClusters (dark blue points indicate noise) Can handle clusters of different shapes and sizes Resistant to noise3/24/2021Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar7575When DBSCAN Does NOT Work WellOriginal Points3/24/202176Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar76

When DBSCAN Does NOT Work Well(MinPts 4, Eps 9.92).Original Points Varying densities High-dimensional data3/24/2021(MinPts 4, Eps 9.75)Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar7777DBSCAN: Determining EPS and MinPts Idea is that for points in a cluster, their kth nearestneighbors are at close distanceNoise points have the kth nearest neighbor at fartherdistanceSo, plot sorted distance of every point to its kthnearest neighbor3/24/202178Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar78

Cluster Validity For supervised classification we have a variety ofmeasures to evaluate how good our model is– Accuracy, precision, recall For cluster analysis, the analogous question is how toevaluate the “goodness” of the resulting clusters? But “clusters are in the eye of the beholder”!– In practice the clusters we find are defined by the clusteringalgorithm Then why do we want to evaluate them?––––To avoid finding patterns in noiseTo compare clustering algorithmsTo compare two sets of clustersTo compare two clustersIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, yRandomPointsyClusters found in Random .6x3/24/20210.6x0.81000.20.40.60.81xIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar80

Measures of Cluster Validity Numerical measures that are applied to judge various aspectsof cluster validity, are classified into the following two types.– Supervised: Used to measure the extent to which cluster labelsmatch externally supplied class labels. EntropyOften called external indices because they use information external to the data– Unsupervised: Used to measure the goodness of a clusteringstructure without respect to external information. Sum of Squared Error (SSE)Often called internal indices because they only use information in the dataYou can use supervised or unsupervised measures to compareclusters or clusteringsIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20218181Unsupervised Measures: Cohesion and Separation Cluster Cohesion: Measures how closely relatedare objects in a cluster– Example: SSE Cluster Separation: Measure how distinct or wellseparated a cluster is from other clusters Example: Squared Error– Cohesion is measured by the within cluster sum of squares (SSE)𝑆𝑆𝐸𝑥𝑚 – Separation is measured by the between cluster sum of squares𝑆𝑆𝐵𝐶 𝑚𝑚Where 𝐶 is the size of cluster i3/24/202182Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar82

Unsupervised Measures: Cohesion and Separation Example: SSE– SSB SSE constant 1mm12K 1 cluster:𝑆𝑆𝐸 313 4234m2355310𝑆𝑆𝐵 43 30𝑇𝑜𝑡𝑎𝑙 10 0 10K 2 clusters: 𝑆𝑆𝐸1 1.52 1.54 4.53 1.524.5 39𝑆𝑆𝐵 2𝑇𝑜𝑡𝑎𝑙 1 9 103/24/202154.5Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar18383Unsupervised Measures: Cohesion and Separation A proximity graph-based approach can also be used forcohesion and separation.– Cluster cohesion is the sum of the weight of all links within a cluster.– Cluster separation is the sum of the weights between nodes in the clusterand nodes outside the cluster.cohesion3/24/202184separationIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar84

Unsupervised Measures: Silhouette Coefficient Silhouette coefficient combines ideas of both cohesion and separation,but for individual points, as well as clusters and clusteringsFor an individual point, i– Calculate a average distance of i to the points in its cluster– Calculate b min (average distance of i to points in another cluster)– The silhouette coefficient for a point is then given bys (b – a) / max(a,b)i– Value can vary between -1 and 1– Typically ranges between 0 and 1.Distances usedto calculate bDistances usedto calculate a– The closer to 1 the better. Can calculate the average silhouette coefficient for a cluster or aclusteringIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20218585Measuring Cluster Validity Via Correlation Two matrices––Proximity MatrixIdeal Similarity Matrix Compute the correlation between the two matrices– Correlation may be positive or negative depending on whetherthe similarity matrix is a similarity or dissimilarity matrixNot a good measure for some density or contiguity basedclusters.3/24/202186Since the matrices are symmetric, only the correlation betweenn(n-1) / 2 entries needs to be calculated.High magnitude of correlation indicates that points thatbelong to the same cluster are close to each other.– One row and one column for each data pointAn entry is 1 if the associated pair of points belong to the same clusterAn entry is 0 if the associated pair of points belongs to different clustersIntroduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar86

Measuring Cluster Validity Via Correlation Correlation of ideal similarity and proximitymatrices for the K-means clusterings of thefollowing well-clustered data 0.12040x60800100 SimilarityPointsCorr 0.9235Introduction to Data Mining, 2nd EditionTan, Steinbach, Karpatne, Kumar3/24/20218787Measuring Cluster Validity Via Correlation Correlation of ideal similarity and proximitymatrices for the K-means clusterings of thefo

and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining by Tan, Steinbach, Kumar Introduction to Data Mining, 2nd Edition Tan, Steinbach, Karpatne, Kumar 3/24/2021 Introduction to Data Mining, 2nd Edition 2 Tan, Steinbach, Karpatne, Kumar What is Cluster Analysis? G

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

Part One: Heir of Ash Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18 Chapter 19 Chapter 20 Chapter 21 Chapter 22 Chapter 23 Chapter 24 Chapter 25 Chapter 26 Chapter 27 Chapter 28 Chapter 29 Chapter 30 .

TO KILL A MOCKINGBIRD. Contents Dedication Epigraph Part One Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Part Two Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18. Chapter 19 Chapter 20 Chapter 21 Chapter 22 Chapter 23 Chapter 24 Chapter 25 Chapter 26

GEOMETRY NOTES Lecture 1 Notes GEO001-01 GEO001-02 . 2 Lecture 2 Notes GEO002-01 GEO002-02 GEO002-03 GEO002-04 . 3 Lecture 3 Notes GEO003-01 GEO003-02 GEO003-03 GEO003-04 . 4 Lecture 4 Notes GEO004-01 GEO004-02 GEO004-03 GEO004-04 . 5 Lecture 4 Notes, Continued GEO004-05 . 6

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid