Text Categorization Using Naïve Bayes

1y ago

10 Views

2 Downloads

839.00 KB

52 Pages

Last View : 9d ago

Last Download : 3m ago

Upload by : Tripp Mcmullen

Report this link

Download PDF

Transcription

Text Categorizationusing Naïve BayesMausam(based on slides of Dan Weld,Prabhakar Raghavan, Hinrich Schutze,Guillaume Obozinski, David D. Lewis)1

Categorization Given:– A description of an instance, x X, where X isthe instance language or instance space.– A fixed set of categories:C {c1, c2, cn} Determine:– The category of x: c(x) C, where c(x) is acategorization function whose domain is X andwhose range is C.2

Sample Category Learning Problem Instance language: size, color, shape – size {small, medium, large}– color {red, blue, green}– shape {square, circle, triangle} C {positive, negative} ve4largebluecirclenegative3

Another Example: County vs. Country?4

Example: County vs. Country? Given:– A description of an instance, x X,where X is the instance language orinstance space.– A fixed set of categories:C {c1, c2, cn} Determine:– The category of x: c(x) C, where c(x)is a categorization function whosedomain is X and whose range is C.5

Text Categorization Assigning documents to a fixed set of categories, e.g. Web pages– Yahoo-like classification What else? Email messages– Spam filtering– Prioritizing– Folderizing News articles– Personalized newspaper Web Ranking– Is page related to selling something?6

Procedural Classification Approach:– Write a procedure to determine a document‟s class– E.g., Spam?7

Learning for Text Categorization Hard to construct text categorization functions. Learning Algorithms:––––––Bayesian (naïve)Neural networkRelevance Feedback (Rocchio)Rule based (C4.5, Ripper, Slipper)Nearest Neighbor (case based)Support Vector Machines (SVM)8

Learning for Categorization A training example is an instance x X, pairedwith its correct category c(x): x, c(x) foran unknown categorization function, c. Given a set of training examples, D.{ , county , , country , Find a hypothesized categorization function,h(x), such that: x, c( x) D : h( x) c( x)Consistency9

Function ApproximationMay not be any perfect fitClassification discrete functionsh(x) nigeria(x) wire-transfer(x)h(x)c(x)10x

General Learning Issues Many hypotheses consistent with the training data. Bias– Any criteria other than consistency with the training datathat is used to select a hypothesis. Classification accuracy– % of instances classified correctly– (Measured on independent test data.) Training time– Efficiency of training algorithm Testing time– Efficiency of subsequent classification11

Generalization Hypotheses must generalize to correctly classifyinstances not in the training data. Simply memorizing training examples is aconsistent hypothesis that does not generalize.12

Why is Learning Possible?Experience alone never justifies anyconclusion about any unseen instance.Learning occurs whenPREJUDICE meets DATA!13

Bias The nice word for prejudice is “bias”. What kind of hypotheses will you consider?– What is allowable range of functions you use whenapproximating? What kind of hypotheses do you prefer?14

Bayesian Methods Learning and classification methods basedon probability theory.– Bayes theorem plays a critical role inprobabilistic learning and classification.– Uses prior probability of each category givenno information about an item. Categorization produces a posteriorprobability distribution over the possiblecategories given a description of an item.15

Bayesian Categorization Let set of categories be {c1, c2, cn} Let E be description of an instance. Determine category of E by determining for each ciP(ci ) P( E ci )P(ci E ) P( E ) P(E) can be ignored since is factor categoriesP (ci E ) P (ci )P (E ci )16

Bayesian CategorizationP (ci E ) P (ci )P (E ci ) Need to know:– Priors: P(ci)– Conditionals: P(E ci) P(ci) are easily estimated from data.– If ni of the examples in D are in ci,then P(ci) ni / D Assume instance is a conjunction of binary features:E e1 e2 em Too many possible instances (exponential in m) toestimate all P(E ci)17

Naïve Bayesian Motivation Problem: Too many possible instances(exponential in m)to estimate all P(E ci) If we assume features of an instance are independentgiven the category (ci) (conditionally independent).mP( E ci ) P(e1 e2 em ci ) P(e j ci )j 1 Therefore, we then only need to know P(ej ci) foreach feature and category.18

Naïve Bayes Example C {allergy, cold, well} e1 sneeze; e2 cough; e3 fever E {sneeze, cough, fever}ProbWellColdAllergyP(ci)0.90.050.05P(sneeze ci)0.10.90.9P(cough ci)0.10.80.7P(fever ci)0.010.70.4CCCC19

Naïve Bayes Example (sneeze ci)0.10.90.9P(cough ci)0.10.80.7P(fever ci)0.010.70.4E {sneeze, cough, fever}P(well E) (0.9)(0.1)(0.1)(0.99)/P(E) 0.0089/P(E)P(cold E) (0.05)(0.9)(0.8)(0.3)/P(E) 0.01/P(E)P(allergy E) (0.05)(0.9)(0.7)(0.6)/P(E) 0.019/P(E)Most probable category: allergyP(E) 0.089 0.01 0.019 0.0379P(well E) 0.23P(cold E) 0.26P(allergy E) 0.5020

Learning Probabilities Normally, probabilities are estimated based onobserved frequencies in the training data. If D contains ni examples in category ci, and nij ofthese ni examples contains feature ej, then:nijP(e j ci ) ni However, estimating such probabilities from smalltraining sets is error-prone. If due only to chance, a rare feature, ek, is alwaysfalse in the training data, ci :P(ek ci) 0. If ek then occurs in a test example, E, the result isthat ci: P(E ci) 0 and ci: P(ci E) 021

Smoothing To account for estimation from small samples,probability estimates are adjusted or smoothed. Laplace smoothing using an m-estimate assumes thateach feature is given a prior probability, p, that isassumed to have been previously observed in a“virtual” sample of size m.P(e j ci ) nij mpni m (nij 1) / (ni 2) For binary features, p is simply assumed to be 0.5.22

Naïve Bayes for Text Modeled as generating a bag of words for adocument in a given category by repeatedlysampling with replacement from avocabulary V {w1, w2, wm} based on theprobabilities P(wj ci ). Smooth probability estimates with Laplacem-estimates assuming a uniform distributionover all words (p 1/ V ) and m V – Equivalent to a virtual sample of seeing each word ineach category exactly once.23

Text Naïve Bayes Algorithm(Train)Let V be the vocabulary of all words in the documents in DFor each category ci CLet Di be the subset of documents in D in category ciP(ci) Di / D Let Ti be the concatenation of all the documents in DiLet ni be the total number of word occurrences in TiFor each word wj VLet nij be the number of occurrences of wj in TiLet P(wi ci) (nij 1) / (ni V )24

Text Naïve Bayes Algorithm(Test)Given a test document XLet n be the number of word occurrences in XReturn the category:nargmax P(ci ) P(ai ci )ci Ci 1where ai is the word occurring the ith position in X25

Naïve Bayes Time Complexity Training Time: O( D Ld C V ))where Ld is the average length of a document in D.– Assumes V and all Di , ni, and nij pre-computed inO( D Ld) time during one pass through all of the data.– Generally just O( D Ld) since usually C V D Ld Test Time: O( C Lt)where Lt is the average length of a test document. Very efficient overall, linearly proportional to thetime needed to just read in all the data.26

Easy to Implement But If you do it probably won‟t work 27

Probabilities: Important Detail! P(spam E1 En) i P(spam E )iAny more potential problems here? We are multiplying lots of small numbersDanger of underflow! 0.557 7 E -18 Solution? Use logs and add! p1 * p2 e log(p1) log(p2) Always keep in log form28

Multi-Class Categorization Pick the category with max probability One-vs-all (OVA) Create many 1 vs other classifiers– Classes City, County, Country– Classifier 1 {City} {County, Country}– Classifier 2 {County} {City, Country}– Classifier 3 {Country} {City, County} All-vs-all (AVA) For each pair of classes build aclassifier– {City vs. County}, {City vs Country}, {County vs. Country}29

Multi-Class Categorization Pick the category with max probability Create many OVA/AVA classifiers Use a hierarchical approach (whereverhierarchy available)EntityPersonScientist ArtistLocationCityCounty Country30

Advantages Simple to implement– No numerical optimization, matrix algebra, etc Efficient to train and use– Easy to update with new data– Fast to apply Binary/multi-class Independence allows parameters to be estimatedon different datasets Comparitively good effectiveness with smalltraining sets31

Disadvantages Independence assumption wrong– Absurd estimates of class probabilities Output probabilities close to 0 or 1– Thresholds must be tuned; not set analytically Generative model– Generally lower effectiveness thandiscriminative techniques– Improving parameter estimates can hurtclassification effectiveness32

Experimental EvaluationQuestion: How do we estimate theperformance of classifier on unseen data? Can‟t just at accuracy on training data – thiswill yield an over optimistic estimate ofperformance Solution: Cross-validation Note: this is sometimes called estimatinghow well the classifier will generalize33

Evaluation: Cross Validation Partition examples into k disjoint sets Now create k training sets– Each set is union of all equiv classes except one– So each set has (k-1)/k of the original training dataTrain Test TestTest 34

Cross-Validation (2) Leave-one-out– Use if 100 examples (rough estimate)– Hold out one example, train on remaining examples 10-fold– If have 100-1000‟s of examples M of N fold– Repeat M times– Divide data into N folds, do N fold cross-validation35

Evaluation Metrics Accuracy: no. of questions correctly answered Precision (for one label): accuracy when classification label Recall (for one label): measures how many instances of a label weremissed. F-measure (for one label): harmonic mean of precision & recall. Area under Precision-recall curve (for one label): vary parameter toshow different points on p-r curve; take the area36

Precision & RecallActualTwo class situationPNMulti-class cision TP/(TP FP)Recall TP/(TP FN)F-measure 2pr/(p r)37

A typical precision-recall curve38

Precision & RecallActualTwo class situationPNMulti-class cision TP/(TP FP)Recall TP/(TP FN)F-measure 2pr/(p r)39

Assignment 4 20 newsgroups data Use Weka to learn a multi-class classifier Evaluation accuracy– Output a label for each document4040

Construct Better Features Key to machine learning is having goodfeatures In industrial data mining, large effortdevoted to constructing appropriate features Ideas?41

Possible Feature Ideas Look at capitalization (may indicated aproper noun) Look for commonly occurring sequences E.g. New York, New York City Limit to 2-3 consecutive words Keep all that meet minimum threshold (e.g.occur at least 5 or 10 times in corpus)42

Properties of Text Word frequencies - skewed distribution The‟ and of‟ account for 10% of all words Six most common words account for 40%Zipf‟s Law:Rank * probability cEg, c 0.1From [Croft, Metzler & Strohman 2010]43

Associate Press Corpus AP89‟From [Croft, Metzler & Strohman 2010]44

Middle Ground Very common words bad features Language-based stop list:words that bear little meaning20-500 wordshttp://www.dcs.gla.ac.uk/idom/ir resources/linguistic utils/stop words Subject-dependent stop lists Very rare words also bad featuresDrop words appearing less than k times / corpus45

Stemming Are there different index terms?– retrieve, retrieving, retrieval, retrieved,retrieves Stemming algorithm:– (retrieve, retrieving, retrieval, retrieved,retrieves) retriev– Strips prefixes of suffixes (-s, -ed, -ly, -ness)– Morphological stemmingCopyright Weld 2002-20074646

Stemming Continued Can reduce vocabulary by 1/3 C, Java, Perl versions, python, c#www.tartarus.org/ martin/PorterStemmer Criterion for removing a suffix– Does "a document is about w 1" mean the same as– a "a document about w 2" Problems: sand / sander & wand / wander Commercial SEs use giant in-memory tablesCopyright Weld 2002-20074747

Word Frequency Which word is more indicative of document similarity?– „book,‟ or „Rumplestiltskin‟?– Need to consider “document frequency”--- how frequently theword appears in doc collection. Which doc is a better match for the query “Kangaroo”?– One with a single mention of Kangaroos or a doc thatmentions it 10 times?– Need to consider “term frequency”--- how many times theword appears in the current document.48

TF x IDFwik tf ik * log( N / nk )Tk term k in document Ditfik frequency of term Tk in document Diidfk inverse document frequency of term Tk in Cidfk log N nk N total number of documents in the collection Cnk the number of documents in C that contain Tk49

Inverse Document Frequency IDF provides high values for rare words andlow values for common words 10000 log 0 10000 10000 log 0.301 5000 10000 log 2.698 20 10000 log 4 1 50

TF-IDF normalization Normalize the term weights– so longer docs not given more weight (fairness)– force all values to fall within a certain range: [0, 1]wik tf ik log( N / nk )22(tf)[log(N/n)] k 1 ikkt51

Assignment 4 20 newsgroups data Use Weka to learn a multi-class classifier Use any of these ideas or more–––––––Stop words/rare wordsStemmingTf-idfPOS enhanced featuresDifferent classifiersParameter searchetc5252

Precision (for one label): accuracy when classification label Recall (for one label): measures how many instances of a label were missed. F-measure (for one label): harmonic mean of precision & recall. Area under Precision-recall curve (for one label): vary parameter to show different points on p-r curve; take the area 36

Related Documents:

Actividades De Repaso : Primer Grado Cálculo Mental Matemáticas

Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text

51 Views

1y ago

Urdu News Classification using Application of Machine Learning ... - IJCSNS

literature for automatic Urdu news categorization. In one of the pioneer work in automatic English text categorization [3] used Reuters-21578 and OHSUMED corpus to categorize English text using support vector machine. [4] compared efficiency of five machine learning algorithms in domain of text categorization with respect to different parameters.

8 Views

1y ago

Urdu News Classification using Application of Machine Learning ...

9 Views

1y ago

Graphic Design Introduction

Window Text Wrap Place text box with text be-tween 2 columns Open the text wrap window and select Wrap Around Object Shape. The body text should move away from your other text box. (never put a text box in the middle of a column) To fine tune the text box so text does not touch the edge of the

50 Views

2y ago

Boshra EL-Guindy, Ph.D LEARN Workshops September 27, 2017

Making Connections Text-to-Text, Text-to-Self, Text-to-World Rationale Reading comes alive when we recognize how the ideas in the text connect to our experiences and beliefs, events happening in the larger world, our understanding of history, and our knowledge of other texts. "Text-to-Text, Text-to-Self, Text-to-World" is a strategy that helps

30 Views

1y ago

1 Topic - Laboratoire ERIC

Tanagra Data Mining Ricco Rakotomalala 17 septembre 2016 Page 1/25 1 Topic “Text mining” with Knime and RapidMiner. Reuters Text Categorization. The statistical approach of the "text mining" consists in to transform a collection of text documents in a matrix of numeric v

14 Views

2y ago

Large-Scale Categorization of Japanese Product Titles ...

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 663–668, Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics Large-Scale Categorization of Japanese Product Titles Using Neural . of meta-data. Rakuten Ichiba1 is an example of

24 Views

3y ago

Gender Recognition by Voice Using an Improved Self-Labeled Algorithm

Abstract: Speech recognition has various applications including human to machine interaction, sorting of telephone calls by gender categorization, video categorization with tagging and so on. Currently, machine learning is a popular trend which has been widely utilized in various ﬁelds and applications,

10 Views

1y ago

Recent Views

IN THIS ISSUE CAR WASH INSIGHT Recent, Notable M&A Transactions .

9/8/2022 Club Car Wash Sites of Tidal Wave Express Car Wash 8 8/29/2022 Take 5 Car Wash Soft Touch Car Wash, Auto Oasis Car Wash, Clearwater Car Wash and Birdie's Car Wash 5 8/25/2022 WhiteWater Express Geaux Clean Car Wash 7 8/19/2022 ModWash Home Team Car Wash 3 8/18/2022 Splash In ECO Car Wash (Wills Group) Blue Hen Car Wash 2

9m ago

100 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

ESSENTIAL PLAN - Discovery

Car insurance only Car and home insurance Car insurance only Car and home insurance 12.5% 25% 5% 10% YOUR FUEL CASH BACK PERCENTAGE GET TO THE HIGHEST CASH BACK PERCENTAGE Add at least R250 000 of home insurance (household contents, buildings or both) Take your car to Tiger Wheel & Tyre and pass the Annual MultiPoint check

1y ago

269 Views

CAR INSURANCE EVERYTHING EXPLAINED - RSA Insurance Group

CAR INSURANCE 93013821.indd 1 15/03/2018 10:46. 2 WELCOME TO µ CAR INSURANCE Thank you for choosing µ to protect you and your car. This booklet is intended to help you check your cover and to reassure you that µ will give you the protection you need for the year ahead. First of all, to help you understand your car insurance policy we want to .

1y ago

274 Views

Describe types and purposes of insurance.

D.O. CAPS Consumer Skills: Insurance—10E 3 Your car - The car you drive can also affect your insurance rates. Insurance companies place certain kinds of cars in special risk categories. You should ask your insurance agent before making a car purchase to make sure you aren't getting a car that will cost you extra for your liability insurance.

1y ago

233 Views

Contours Options Infant Car Seat Adapter Instruction Sheet

your Infant Car Seat, as described in the instruction manual provided by the Infant Car Seat manufacturer. † WHEN USING ONLY ONE INFANT CAR SEAT ADAPTER OR TWO FOR TWINS, THE FOLLOWING INFANT CAR SEATS CAN BE USED: † If your Infant Car Seat is not one of the models listed above, DO NOT use your infant car seat with this car seat adapter.

2y ago

564 Views

Microsoft Advertising Travel Update

last minute cruise deals -58.50% Car Rental Queries WoW Change car rental -43.80% rental cars -46.30% car rentals -40.60% cheap car rentals -48.00% car rentals cheapest rates -52.20% rent a car- 40.30% cheap rental cars -45.60% rental car -41.80% car rental deals -49.30% rental cars lowest price -53.90% Flight Queries WoW Change cheap flights .

1y ago

337 Views

Design and development of lift for an automatic car parking system

1. Stacker type car parking system 2. Puzzle type car parking system 3. Level type car parking system 4. Chess type car parking system 5. Rotary type car parking system 6. Tower type car parking system But lift is used only in tower type car parking system. Objectives:-

6m ago

172 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Car Insurance This booklet covers:Car Rapid Bonus Business

Car Insurance This booklet covers:Car Rapid Bonus Business RAC Direct Insurance is a trading name of London and Edinburgh Insurance Company Limited. Registered in England No 924430. Registered Office: 8 Surrey Street, Norwich NR1 3NG. Member of the Aviva Group. Authorised and regulated by the Financial Services Authority. RAC052(V27)-1971-06.06 .

1y ago

218 Views

Root Insurance (ROOT) - Citron Research

Root Insurance (ROOT) Leveling the Playing Field of Car Insurance What every trader needs to know about one of the mostheavily shorted stocks in the market Traditional Credit-Based Car Insurance PerpetuatesEconomic and Racial Inequalities as one in three American cannot affordessentials because of car insurance premiums

1y ago

209 Views

NK-ID 0192-8365-3702-0D3E - Car-O-Liner

CAR-O-DATA. 4. The vast majority of vehicles on the road today can be found in Car-O-Liner's database. Your . Car-O-Tronic. is delivered with a 14-day trial . Car-O-Data Vision2. subscription. Car-O-Data. is available with different subscription periods and database. 4. Check all options with our distributors. SOFTWARE PART. NO. Vision2 X1 .

3y ago

321 Views

46686 Vision2 IM EN r0 - Metropolitan Car-o-liner

Car-O-Tronic, Vision2 Software and Car-O-Data. Car-O-Tronic is the measuring hardware, Vision2 Software is the measuring software. Car-O-Data is a database containing Car-O-Liner DataSheets, photo DataSheets and indexes for most vehicles. Car-O-Data is available through an online subscription or a DVD subscription which is updated 4 times a year.

3y ago

295 Views

Colorado Masonic Library & Museum Store

York Rite 15.00 _ CE40 Car Emblem - Order of the Eastern Star Cut-Out Auto Car Emblem-CE40 OES 15.00 _ CE41 Car Emblem - Shriners Cut-Out Auto Car Emblem-CE41 Shrine 15.00 _ CE42 Car Emblem - 33rd Degree Wings Up Cut-Out Auto Car Emblem-CE42 Scottish Rite 15.00 _ CE43 Car Emblem Free & Ac

2y ago

517 Views

Queueing Theory Part 2 - UW Courses Web Server

Queueing Theory-12 Car Wash Example Consider the following 3 car washes Suppose cars arrive according to a Poisson input process and service follows an exponential distribution Fill in the following table What conclusions can you draw from your results? ! µ! L L q W W q P 0 Car Wash A 0.1 car/min 0.5 car/min Car Wash B 0.1 car/min

1y ago

245 Views

Text Categorization Using Naïve Bayes

It looks like you're using an ad-blocker