Data Mining Taylor Statistics 202: Data Mining

3y ago

39 Views

2 Downloads

313.87 KB

21 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Giovanna Wyche

Report this link

Download PDF

Transcription

Statistics 202:Data Miningc JonathanTaylorStatistics 202: Data MiningOutliersBased in part on slides from textbook, slides of Susan Holmesc Jonathan TaylorDecember 2, 20121/1

OutliersStatistics 202:Data Miningc JonathanTaylorConceptsWhat is an outlier? The set of data points that areconsiderably different than the remainder of the data . . .When do they appear in data mining tasks?Given a data matrix X , find all the cases x i X withanomaly/outlier scores greater than some threshold t. Or,the top n outlier scores.Given a data matrix X , containing mostly normal (butunlabeled) data points, and a test case x new , compute ananomaly/outlier score of x new with respect to X .ApplicationsCredit card fraud detection;Network intrusion detection;Misspecification of a model.2/1

What is an outlier?Statistics 202:Data Miningc JonathanTaylor3/1

OutliersStatistics 202:Data Miningc JonathanTaylorIssuesHow many outliers are there in the data?Method is unsupervised, similar to clustering or findingclusters with only 1 point in them.Usual assumption: There are considerably more “normal”observations than “abnormal” observations(outliers/anomalies) in the data.4/1

OutliersStatistics 202:Data Miningc JonathanTaylorGeneral stepsBuild a profile of the “normal” behavior. The profilegenerally consists of summary statistics of this “normal”population.Use these summary statistics to detect anomalies, i.e.points whose characteristics are very far from the normalprofile.General types of schemes involve a statistical model of“normal”, and “far” is measured in terms of likelihood.Other schemes based on distances can be quasi-motivatedby such statistical techniques . . .5/1

OutliersStatistics 202:Data Miningc JonathanTaylorStatistical approachAssume a parametric model describing the distribution ofthe data (e.g., normal distribution)Apply a statistical test that depends on:Data distribution (e.g. normal)Parameter of distribution (e.g., mean, variance)Number of expected outliers (confidence limit, α or Type Ierror)6/1

OutliersStatistics 202:Data Miningc JonathanTaylorGrubbs’ TestSuppose we have a sample of n numbersZ {Z1 , . . . , Zn }, i.e. a n 1 data matrix.Assuming data is from normal distribution, Grubbs’ testsuses distribution ofZmax1 i n Zi Z̄Z)SD(Zto search for outlying large values.7/1

OutliersStatistics 202:Data Miningc JonathanTaylorGrubbs’ TestLower tail variant:Zmin1 i n Zi Z̄Z)SD(ZTwo-sided variant:Z max1 i n Zi Z̄Z)SD(Z8/1

OutliersStatistics 202:Data Miningc JonathanTaylorGrubbs’ TestHaving chosen a test-statistic, we must determine athreshold that sets our “threshold” ruleOften this is set via a hypothesis test to control Type Ierror.For large positive outlier, threshold is based on choosingsome acceptable Type I error α and finding cα so that Z max1 i n Zi Z̄P0 cα αZ)SD(ZAbove, P0 denotes the distribution of Z under theassumption there are no outliers.If Z are IID N(µ, σ 2 ) it is generally possible to compute adecent approximation of this probability using Bonferonni.9/1

OutliersStatistics 202:Data Miningc JonathanTaylorGrubbs’ TestTwo sided critical level has the formvu2tα/(2n),n 2n 1ucα t2nn 2 tα/(2n),n 2whereP(Tk tγ,k ) γis the upper tail quantile of Tk .In R, you can use the functions pnorm, qnorm, pt, qtfor these quantities.10 / 1

based techniquesModel based: linear regression with outliersldStatisticsa model202:Data Miningc JonathanTaylordon’t fit the modelwhichidentified as outliersexample at the right,quares regressione appropriatels can be fed in totest.arFigure : Residuals from model can be fed into Grubbs’ test orBonferroni (variant)Introduction to Data Mining4/18/20041111 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorMultivariate dataIf the non-outlying data is assumed to be multivariateGaussian, what is the analogy of Grubbs’ statisticZ max1 i n Zi Z̄Z)SD(ZAnswer: use Mahalanobis distanceb 1 (Zi Z̄Z )T ΣZ)max (Zi Z̄1 i nAbove, each individual statistic has what looks like aHotelling’s T 2 distribution.12 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorLikelihood approachAssume data is a mixtureF (1 λ)M λA.Above, M is the distribution of “most of the data.”The distribution A is an “outlier” distribution, could beuniform on a bounding box for the data.This is a mixture model. If M is parametric, then the EMalgorithm fits naturally here.Any points assigned to A are “outliers.”13 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorLikelihood approachDo we estimate λ or fix it?The book starts describing an algorithm that tries tomaximize the equivalent classification likelihood YL(θM , θA ; l) (1 λ)#lMfM (xi , θM ) i lM λ#lA YfA (xi ; θA ) i lA14 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorLikelihood approach: AlgorithmAlgorithm tries to maximize this by forming iterativeestimates (Mt , At ) of “normal” and “outlying” datapoints.1234At each stage, tries to place individual points of Mt to At .Find (θbM , θbA ) based on partition new partition (ifnecessary).If increase in likelihood is large enough, call these new set(Mt 1 , At 1 ).Repeat until no further changes.15 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorNearest neighbour approachMany ways to define outliers.Example: data points for which there are fewer than kneighboring points within a distance .Example: the n points whose distance to k-th nearestneighbour is largest.The n points whose average distance to the first k nearestneighobours is largest.Each of these methods all depend on choice of someparameters: k, n, . Difficult to choose these in asystematic way.16 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorDensity approachFor each point, x i compute a density estimate fx i ,k usingits k nearest neighbours.Density estimate used isPfx i ,k xi,y )y N(xx i ,k) d(x! 1#N(xx i , k)Definefx i ,kLOF (xx i ) P( y N(xx i ,k) fy ,k )/#N(xx i , k)17 / 1

!OutliersStatistics 202:Data Mining!Compute local outlier factor (LOF) of aaverage of the ratios of the density ofdensity of its nearest neighborsOutliers are points with largest LOF vac JonathanTaylorIn thnotwhilbothp2p1!! Tan,Steinbach, KumarIntroduction to Data MiningFigure : Nearest neighbour vs. density based18 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorDetection rateSet P(O) to be the proportion of outliers or anomalies.Set P(D O) to be the probability of declaring an outlier ifit truly is an outlier. This is the detection rate.Set P(D O c ) to the probability of declaring an outlier if itis truly not an outlier.19 / 1

OutliersStatistics 202:Data Miningc JonathanTaylorBayesian detection rateBayesian detection rate isP(O D) P(D O)P(O).P(D O)P(O) P(D O c )P(O c )The false alarm rate or false discovery rate isP(O c D) P(D O c )P(O c ).P(D O c )P(O c ) P(D O)P(O)20 / 1

Statistics 202:Data Miningc JonathanTaylor21 / 1

Credit card fraud detection; Network intrusion detection; Misspeci cation of a model. 2/1. Statistics 202: Data Mining c Jonathan Taylor What is an outlier? 3/1. Statistics 202: Data Mining c Jonathan Taylor Outliers Issues How many outliers are there in the data? Method is unsupervised, similar to clustering or nding clusters with only 1 point in them. Usual assumption: There are considerably .

Related Documents:

BYU Combined Team Statistics (as of Dec 28, 2020) All games

BYU Combined Team Statistics (as of Dec 28, 2020) All games Date Opponent Score Att. Sep 07, 202 at Navy W 55-3 0 Sep 26, 202 TROY W 48-7 0 Oct 02, 202 LOUISIANA TECH W 45-14 0 Oct 10, 202 UTSA W 27-20 0 Oct 16, 202 at Houston W 43-26 10092 Oct 24, 202 TEXAS ST. W 52-14 6570 Oct 31, 202 WESTERN KENTUCKY W 41-10 6843 Nov 6, 2020at #21 Boise .

27 Views

1y ago

DATA MINING - University of Rajshahi

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

13 Views

1y ago

Data Mining in Bioinformatics - UQAM

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

41 Views

2y ago

Data Mining: Why Data Mining? - Leiden University

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

41 Views

3y ago

Statistics 202: Data Mining - Stanford University

Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining . Andrew Luck, No. 7 Stanford roll past San Jose State 57-3 in season ope

11 Views

2y ago

Multi Relational Data Mining Approaches: A Data Mining Technique

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

9 Views

7m ago

ARTIST/GROUP SONG TITLE - America's Karaoke Site

taylor, james & dixie chicks wide open spaces [live tv version] taylor, james & simon, carly mockingbird taylor, james & souther, j.d. her town too taylor, johnnie disco lady taylor, johnnie who's makin' love taylor, koko wang dang doodle taylor, r. dean indiana wants me tea, ming &

39 Views

2y ago

Racial Disparity in Criminal Court Processing in the ...

governing America’s indigent defense services has made people of color second class citizens in the American criminal justice system, and constitutes a violation of the U.S. Government's obligation under Article 2 and Article 5 of the Convention to guarantee “equal treatment” before the courts. 8. Lastly, mandatory minimum sentencing .

49 Views

3y ago

Recent Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

Industry Observations Insurance Industry

Jun 30, 2019 · 6/17/2019 Commercial Insurance Branch of Extraco Banks, N.A. Higginbotham Insurance Group, Inc. Insurance Brokers NA 6/13/2019 Links Insurance Services, LLC World Insurance Associates LLC Property and Casualty Insurance NA 6/13/2019 Abram Interstate Insurance Services, Inc. Risk Placement Services,

2y ago

619 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

18.01.41 - REPLACEMENT OF LIFE INSURANCE AND ANNUITIES - Idaho

Department of Insurance Replacement of Life Insurance and Annuities. Page 3. 04. Existing Life Insurance or Annuity. "Existing Life Insurance or Annuity" means any life insurance or annuity in force, including life insurance under a binding or conditional receipt or a lif e insurance policy or annuity that is within an unconditional refund period.

1y ago

407 Views

EXAMINATION REPORT OF THE ADMIRAL INSURANCE COMPANY AS OF . - Delaware

Berkley Regional Specialty Insurance Comp 31295 DE Carolina Casualty Insurance Company 10510 IA Clermont Insurance Company 33480 IA Continental Western Insurance Company 10804 IA Firemen's Insurance Com pany of Wash, D.C. 21784 DE Gemini Insurance Company 10833 DE Great Divide Insurance Company 25224 ND

1y ago

258 Views

American International Group, Inc. - Federal Reserve

American General Life Insurance Company AGL U.S. Life Insurance Company AGC Life Insurance Company AGC Life U.S. Life Insurance Company The United States Life Insurance Company in the City of New York U.S. Life U.S. Life Insurance Company The Variable Annuity Life Insurance Company VALIC U.S. Life Insurance Company

1y ago

269 Views

Japan's Insurance Market - Toa Re

with 61.6% of net premiums written, of which automobile insurance totaled 48.8% and compulsory automobile liability insurance totaled 12.8%. Fire insurance accounted for 13.7%, miscellaneous casualty insurance including liability insurance accounted for 11.6%, accident insurance accounted for 9.8%, and marine insurance accounted for 3.2%.

1y ago

179 Views

List of Insurance Companies by Insurance Manager - Cayman Islands dollar

2447 Batan Insurance Company SPC, Ltd. 29-Sep-03 1307714 BBG Insurance Services, Ltd. 09-Aug-16 1254 BCHS Insurance, Ltd. 07-Oct-98 1168 Bearacuda Re 01-Aug-97 2639 Bedrock Insurance Limited 24-Nov-05 2150 Bom Ambiente Insurance Company 14-Jun-00 2565 Boundless Insurance Company, Ltd. 01-Dec-04 769 Bucap Limited 03-Mar-89

1y ago

293 Views

Insurance Certificate 713705-3 and Assistance Program

Name of insurance product: Purchase Protection and Travel Insurance for National Bank of Canada Mastercard credit cards, group insurance policy no. 713705 (Schedule A Certificate number 3)/713705-3 Type of insurance product: Purchase insurance and extended warranty and travel insurance (group insurance) Assistance provider contact information

3m ago

54 Views

Oracle Insurance Performance Insight for General Insurance

for General Insurance Overview Oracle Insurance Performance Insight for General Insurance (OIPIGI) is a comprehensive business intelligence system created exclusively for the General Insurance/Property and Casualty (P&C) insurance industry. OIPIGI provides a complete set of web-based analytical and reporting components that enable users to

1y ago

175 Views

S OF GENERAL INSURANCE

General Insurance comprises of insurance of property against fire, burglary etc, personal insurance such as Accident and Health Insurance, and liability insurance which covers legal liabilities. Suitable general Insurance covers are necessary for every family. It is important to protect one’s property, which

3y ago

278 Views

Insurance Act 1978 - Bermuda Laws

INSURANCE MANAGERS, BROKERS, AGENTS, INSURANCE MARKETPLACE PROVIDERS AND SALESMEN Insurance managers, agents and insurance marketplace providers to maintain lists of insurers for which they act Insurance broker, agent, salesman or insurance marketplace provider deemed agent of insurer in cert

2y ago

280 Views

Data Mining Taylor Statistics 202: Data Mining

It looks like you're using an ad-blocker