Data Mining Taylor Statistics 202: Data Mining

3y ago
39 Views
2 Downloads
313.87 KB
21 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Giovanna Wyche
Transcription

Statistics 202:Data Miningc JonathanTaylorStatistics 202: Data MiningOutliersBased in part on slides from textbook, slides of Susan Holmesc Jonathan TaylorDecember 2, 20121/1

OutliersStatistics 202:Data Miningc JonathanTaylorConceptsWhat is an outlier? The set of data points that areconsiderably different than the remainder of the data . . .When do they appear in data mining tasks?Given a data matrix X , find all the cases x i X withanomaly/outlier scores greater than some threshold t. Or,the top n outlier scores.Given a data matrix X , containing mostly normal (butunlabeled) data points, and a test case x new , compute ananomaly/outlier score of x new with respect to X .ApplicationsCredit card fraud detection;Network intrusion detection;Misspecification of a model.2/1

What is an outlier?Statistics 202:Data Miningc JonathanTaylor3/1

OutliersStatistics 202:Data Miningc JonathanTaylorIssuesHow many outliers are there in the data?Method is unsupervised, similar to clustering or findingclusters with only 1 point in them.Usual assumption: There are considerably more “normal”observations than “abnormal” observations(outliers/anomalies) in the data.4/1

OutliersStatistics 202:Data Miningc JonathanTaylorGeneral stepsBuild a profile of the “normal” behavior. The profilegenerally consists of summary statistics of this “normal”population.Use these summary statistics to detect anomalies, i.e.points whose characteristics are very far from the normalprofile.General types of schemes involve a statistical model of“normal”, and “far” is measured in terms of likelihood.Other schemes based on distances can be quasi-motivatedby such statistical techniques . . .5/1

OutliersStatistics 202:Data Miningc JonathanTaylorStatistical approachAssume a parametric model describing the distribution ofthe data (e.g., normal distribution)Apply a statistical test that depends on:Data distribution (e.g. normal)Parameter of distribution (e.g., mean, variance)Number of expected outliers (confidence limit, α or Type Ierror)6/1

OutliersStatistics 202:Data Miningc JonathanTaylorGrubbs’ TestSuppose we have a sample of n numbersZ {Z1 , . . . , Zn }, i.e. a n 1 data matrix.Assuming data is from normal distribution, Grubbs’ testsuses distribution ofZmax1 i n Zi Z̄Z)SD(Zto search for outlying large values.7/1

OutliersStatistics 202:Data Miningc JonathanTaylorGrubbs’ TestLower tail variant:Zmin1 i n Zi Z̄Z)SD(ZTwo-sided variant:Z max1 i n Zi Z̄Z)SD(Z8/1

OutliersStatistics 202:Data Miningc JonathanTaylorGrubbs’ TestHaving chosen a test-statistic, we must determine athreshold that sets our “threshold” ruleOften this is set via a hypothesis test to control Type Ierror.For large positive outlier, threshold is based on choosingsome acceptable Type I error α and finding cα so that Z max1 i n Zi Z̄P0 cα αZ)SD(ZAbove, P0 denotes the distribution of Z under theassumption there are no outliers.If Z are IID N(µ, σ 2 ) it is generally possible to compute adecent approximation of this probability using Bonferonni.9/1

OutliersStatistics 202:Data Miningc JonathanTaylorGrubbs’ TestTwo sided critical level has the formvu2tα/(2n),n 2n 1ucα t2nn 2 tα/(2n),n 2whereP(Tk tγ,k ) γis the upper tail quantile of Tk .In R, you can use the functions pnorm, qnorm, pt, qtfor these quantities.10 / 1

based techniquesModel based: linear regression with outliersldStatisticsa model202:Data Miningc JonathanTaylordon’t fit the modelwhichidentified as outliersexample at the right,quares regressione appropriatels can be fed in totest.arFigure : Residuals from model can be fed into Grubbs’ test orBonferroni (variant)Introduction to Data Mining4/18/20041111 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorMultivariate dataIf the non-outlying data is assumed to be multivariateGaussian, what is the analogy of Grubbs’ statisticZ max1 i n Zi Z̄Z)SD(ZAnswer: use Mahalanobis distanceb 1 (Zi Z̄Z )T ΣZ)max (Zi Z̄1 i nAbove, each individual statistic has what looks like aHotelling’s T 2 distribution.12 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorLikelihood approachAssume data is a mixtureF (1 λ)M λA.Above, M is the distribution of “most of the data.”The distribution A is an “outlier” distribution, could beuniform on a bounding box for the data.This is a mixture model. If M is parametric, then the EMalgorithm fits naturally here.Any points assigned to A are “outliers.”13 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorLikelihood approachDo we estimate λ or fix it?The book starts describing an algorithm that tries tomaximize the equivalent classification likelihood YL(θM , θA ; l) (1 λ)#lMfM (xi , θM ) i lM λ#lA YfA (xi ; θA ) i lA14 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorLikelihood approach: AlgorithmAlgorithm tries to maximize this by forming iterativeestimates (Mt , At ) of “normal” and “outlying” datapoints.1234At each stage, tries to place individual points of Mt to At .Find (θbM , θbA ) based on partition new partition (ifnecessary).If increase in likelihood is large enough, call these new set(Mt 1 , At 1 ).Repeat until no further changes.15 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorNearest neighbour approachMany ways to define outliers.Example: data points for which there are fewer than kneighboring points within a distance .Example: the n points whose distance to k-th nearestneighbour is largest.The n points whose average distance to the first k nearestneighobours is largest.Each of these methods all depend on choice of someparameters: k, n, . Difficult to choose these in asystematic way.16 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorDensity approachFor each point, x i compute a density estimate fx i ,k usingits k nearest neighbours.Density estimate used isPfx i ,k xi,y )y N(xx i ,k) d(x! 1#N(xx i , k)Definefx i ,kLOF (xx i ) P( y N(xx i ,k) fy ,k )/#N(xx i , k)17 / 1

!OutliersStatistics 202:Data Mining!Compute local outlier factor (LOF) of aaverage of the ratios of the density ofdensity of its nearest neighborsOutliers are points with largest LOF vac JonathanTaylorIn thnotwhilbothp2p1!! Tan,Steinbach, KumarIntroduction to Data MiningFigure : Nearest neighbour vs. density based18 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorDetection rateSet P(O) to be the proportion of outliers or anomalies.Set P(D O) to be the probability of declaring an outlier ifit truly is an outlier. This is the detection rate.Set P(D O c ) to the probability of declaring an outlier if itis truly not an outlier.19 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorBayesian detection rateBayesian detection rate isP(O D) P(D O)P(O).P(D O)P(O) P(D O c )P(O c )The false alarm rate or false discovery rate isP(O c D) P(D O c )P(O c ).P(D O c )P(O c ) P(D O)P(O)20 / 1

Statistics 202:Data Miningc JonathanTaylor21 / 1

Credit card fraud detection; Network intrusion detection; Misspeci cation of a model. 2/1. Statistics 202: Data Mining c Jonathan Taylor What is an outlier? 3/1. Statistics 202: Data Mining c Jonathan Taylor Outliers Issues How many outliers are there in the data? Method is unsupervised, similar to clustering or nding clusters with only 1 point in them. Usual assumption: There are considerably .

Related Documents:

BYU Combined Team Statistics (as of Dec 28, 2020) All games Date Opponent Score Att. Sep 07, 202 at Navy W 55-3 0 Sep 26, 202 TROY W 48-7 0 Oct 02, 202 LOUISIANA TECH W 45-14 0 Oct 10, 202 UTSA W 27-20 0 Oct 16, 202 at Houston W 43-26 10092 Oct 24, 202 TEXAS ST. W 52-14 6570 Oct 31, 202 WESTERN KENTUCKY W 41-10 6843 Nov 6, 2020at #21 Boise .

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining . Andrew Luck, No. 7 Stanford roll past San Jose State 57-3 in season ope

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

taylor, james & dixie chicks wide open spaces [live tv version] taylor, james & simon, carly mockingbird taylor, james & souther, j.d. her town too taylor, johnnie disco lady taylor, johnnie who's makin' love taylor, koko wang dang doodle taylor, r. dean indiana wants me tea, ming &

governing America’s indigent defense services has made people of color second class citizens in the American criminal justice system, and constitutes a violation of the U.S. Government's obligation under Article 2 and Article 5 of the Convention to guarantee “equal treatment” before the courts. 8. Lastly, mandatory minimum sentencing .