Classification Example: Spam Filter - University Of California, Berkeley

1y ago

2 Views

1 Downloads

2.26 MB

7 Pages

Last View : 5m ago

Last Download : 3m ago

Upload by : Esmeralda Toy

Report this link

Download PDF

Transcription

Machine LearningCS 188: Artificial IntelligenceNaïve BayesUp until now: how use a model to make optimal decisionsMachine learning: how to acquire a model from data / experienceLearning parameters (e.g. probabilities)Learning structure (e.g. BN graphs)Learning hidden concepts (e.g. clustering, neural nets)Today: model-based classification with Naive BayesInstructors: Dan Klein and Pieter Abbeel --- University of California, Berkeley[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]ClassificationExample: Spam FilterInput: an emailOutput: spam/hamDear Sir.Setup:Get a large collection of example emails, each labeled“spam” or “ham”Note: someone has to hand label all this data!Want to learn to predict labels of new, future emailsFeatures: The attributes used to make the ham /spam decisionWords: FREE!Text Patterns: dd, CAPSNon-text: SenderInContacts, WidelyBroadcast Example: Digit RecognitionInput: images / pixel gridsOutput: a digit 0-9Classification: given inputs x, predict labels (classes) y0Examples:Setup:21Features: The attributes used to make the digit decisionPixels: (6,8) ONShape Patterns: NumComponents, AspectRatio, NumLoops Features are increasingly induced rather than craftedTO BE REMOVED FROM FUTUREMAILINGS, SIMPLY REPLY TO THISMESSAGE AND PUT "REMOVE" IN THESUBJECT.99 MILLION EMAIL ADDRESSESFOR ONLY 99Ok, Iknow this is blatantly OT but I'mbeginning to go insane. Had an old DellDimension XPS sitting in the corner anddecided to put it to use, I know it wasworking pre being stuck in the corner,but when I plugged it in, hit the powernothing happened.Other Classification Tasks1Get a large collection of example images, each labeled with a digitNote: someone has to hand label all this data!Want to learn to predict labels of new, future digit imagesFirst, I must solicit your confidence inthis transaction, this is by virture of itsnature as being utterly confidencial andtop secret. ?Medical diagnosis (input: symptoms,classes: diseases)Fraud detection (input: account activity,classes: fraud / no fraud)Automatic essay grading (input: document,classes: grades)Customer service email routingReview sentimentLanguage ID many moreClassification is an important commercial technology!

Model-Based ClassificationModel-Based ClassificationModel-based approachBuild a model (e.g. Bayes’ net) whereboth the output label and inputfeatures are random variablesInstantiate any observed featuresQuery for the distribution of the labelconditioned on the featuresChallengesWhat structure should the BN have?How should we learn its parameters?Naïve Bayes for DigitsGeneral Naïve BayesNaïve Bayes: Assume all features are independent effects of the labelA general Naive Bayes model:One feature (variable) Fij for each grid position i,j Feature values are on / off, based on whether intensityis more or less than 0.5 in underlying imageEach input maps to a feature vector, e.g.YYSimple digit recognition version: Y parametersF1F2F1Fn Y x F nvaluesF2n x F x Y parametersHere: lots of features, each is binary valuedWe only have to specify how each feature depends on the classTotal number of parameters is linear in nModel is very simplistic, but often works anywayNaïve Bayes model:What do we need to learn?Inference for Naïve BayesGeneral Naïve BayesGoal: compute posterior distribution over label variable YWhat do we need in order to use Naïve Bayes?Step 1: get joint probability of label and evidence for each labelInference method (we just saw this part)Start with a bunch of probabilities: P(Y) and the P(Fi Y) tablesUse standard inference to compute P(Y F1 Fn)Nothing new hereEstimates of local conditional probability tables Step 2: sum to get probability of evidenceStep 3: normalize by dividing Step 1 by Step 2P(Y), the prior over labelsP(Fi Y) for each feature (evidence variable)These probabilities are collectively called the parameters of the modeland denoted by θUp until now, we assumed these appeared by magic, but they typically come from training data counts: we’ll look at this soonFn

Example: Conditional ProbabilitiesNaïve Bayes for TextBag-of-words Naïve ures: Wi is the word at position iAs before: predict label conditioned on feature variables (spam vs. ham)As before: assume features are conditionally independent given labelNew: each Wi is identically distributedWord at positioni, not ith word inthe dictionary!Generative model:“Tied” distributions and bag-of-wordsUsually, each variable gets its own conditional probability distribution P(F Y)In a bag-of-words modelEach position is identically distributedAll positions share the same conditional probs P(W Y)Why make this assumption?Called “bag-of-words” because model is insensitive to word order or reorderingExample: Spam FilteringSpam ExampleModel:WordWhat are the parameters?ham : 0.66spam: 0.33the :to :and :of :you 0.00800.0075the :to :of :2002:with:from:and 00P(w spam)P(w ham)Tot 5-0.4P(spam w) 98.9Where do these tables come from?Training and TestingTot Spam(prior)Empirical Risk MinimizationEmpirical risk minimizationBasic principle of machine learningWe want the model (classifier, etc) that does best on the true test distributionDon’t know the true distribution so pick the best model on our actual training setFinding “the best” model on the training set is phrased as an optimization problemMain worry: overfitting to the training setBetter with more training data (less sampling variance, training more like test)Better if we limit the complexity of our hypotheses (regularization and/or smallhypothesis spaces)

Important ConceptsGeneralization and OverfittingData: labeled instances (e.g. emails marked spam/ham)Training setHeld out setTest setTrainingDataFeatures: attribute-value pairs which characterize each xExperimentation cycleLearn parameters (e.g. model probabilities) on training set(Tune hyperparameters on held-out set)Compute accuracy of test setVery important: never “peek” at the test set!Evaluation (many metrics possible, e.g. accuracy)Held-OutDataAccuracy: fraction of instances predicted correctlyOverfitting and generalizationWant a classifier which does well on test dataOverfitting: fitting the training data very closely, but notgeneralizing wellWe’ll investigate overfitting and generalization formally in a fewlecturesTestDataOverfittingExample: Overfitting302520Degree 15 polynomial151050-5-10-152 wins!!024681012141618Example: OverfittingPosteriors determined by relative probabilities (odds ly.::::::infinfinfinfinfinfscreensminuteguaranteed 205.00deliverysignature.What went wrong here?::::::infinfinfinfinfinf20Generalization and OverfittingRelative frequency parameters will overfit the training data!Just because we never saw a 3 with pixel (15,15) on during training doesn’t mean we won’t see it at test timeUnlikely that every occurrence of “minute” is 100% spamUnlikely that every occurrence of “seriously” is 100% hamWhat about all the words that don’t occur in the training set at all?In general, we can’t go around giving unseen events zero probabilityAs an extreme case, imagine using the entire email as the only feature (e.g. document ID)Would get the training data perfect (if deterministic labeling)Wouldn’t generalize at allJust making the bag-of-words assumption gives us some generalization, but isn’t enoughTo generalize better: we need to smooth or regularize the estimates

Parameter EstimationParameter EstimationEstimating the distribution of a random variabler bElicitation: ask a human (why is this hard?)bbrbb bbrbrr b bEmpirically: use training data (learning!)E.g.: for each outcome x, look at the empirical rate of that value:rrbThis is the estimate that maximizes the likelihood of the dataSmoothingMaximum Likelihood?Relative frequencies are the maximum likelihood estimatesAnother option is to consider the most likely parameter value given the data?Unseen EventsLaplace SmoothingLaplace’s estimate:Pretend you saw every outcomeonce more than you actually didCan derive this estimate withDirichlet priors (see cs281a)rrb

Laplace SmoothingEstimation: Linear Interpolation*In practice, Laplace often performs poorly for P(X Y):Laplace’s estimate (extended):Pretend you saw every outcome k extra timesrrbWhen X is very largeWhen Y is very largeAnother option: linear interpolationAlso get the empirical P(X) from the dataMake sure the estimate of P(X Y) isn’t too different from the empirical P(X)What’s Laplace with k 0?k is the strength of the priorLaplace for conditionals:What if α is 0? 1?Smooth each condition independently:For even better ways to estimate parameters, as well as details ofthe math, see cs281a, cs288Real NB: SmoothingTuningFor real classification problems, smoothing is criticalNew odds ratios:helveticaseemsgroupagoareas.: 11.4: 10.8: 10.2: 8.4: 8.3verdanaCreditORDER FONT money.:::::28.828.427.226.926.5Do these make more sense?Tuning on Held-Out DataNow we’ve got two kinds of unknownsParameters: the probabilities P(X Y), P(Y)Hyperparameters: e.g. the amount / type ofsmoothing to do, k, αWhat should we learn where?Learn parameters from training dataTune hyperparameters on different dataWhy?For each value of the hyperparameters, trainand test on the held-out dataChoose the best value and do a final test onthe test dataFeatures

Errors, and What to DoWhat to Do About Errors?Need more features– words aren’t enough!Examples of errorsDear GlobalSCAPE Customer,GlobalSCAPE has partnered with ScanSoft to offer you thelatest version of OmniPage Pro, for just 99.99* - the regularlist price is 499! The most common question we've receivedabout this offer is - Is this genuine? We would like to assureyou that this offer is authorized by ScanSoft, is genuine andvalid. You can get the . . . . . To receive your 30 Amazon.com promotional certificate,click through tohttp://www.amazon.com/appareland see the prominent link for the 30 offer. All details arethere. We hope you enjoyed receiving this message. However, ifyou'd rather not receive future e-mails announcing new storelaunches, please click . . .Have you emailed the sender before?Have 1K other people just gotten the same email?Is the sending information consistent?Is the email in ALL CAPS?Do inline URLs point where they say they point?Does the email address you by (your) name?Can add these information sources as newvariables in the NB modelNext class we’ll talk about classifiers which letyou easily add arbitrary features more easily,and, later, how to induce new featuresBaselinesFirst step: get a baselineBaselines are very simple “straw man” proceduresHelp determine how hard the task isHelp know what a “good” accuracy isWeak baseline: most frequent label classifierGives all test instances whatever label was most common in the training setE.g. for spam filtering, might label everything as hamAccuracy might be very high if the problem is skewedE.g. calling everything “ham” gets 66%, so a classifier that gets 70% isn’t very good For real research, usually use previous work as a (strong) baselineSummaryBayes rule lets us do diagnostic queries with causal probabilitiesThe naïve Bayes assumption takes all features to be independent given the class labelWe can build classifiers out of a naïve Bayes model using training dataSmoothing estimates is important in real systemsClassifier confidences are useful, when you can get themConfidences from a ClassifierThe confidence of a probabilistic classifier:Posterior probability of the top labelRepresents how sure the classifier is of the classificationAny probabilistic model will have confidencesNo guarantee confidence is correctCalibrationWeak calibration: higher confidences mean higher accuracyStrong calibration: confidence predicts accuracy rateWhat’s the value of calibration?Next Time: Discriminative Learning

Classification Example: Spam Filter Input: an email Output: spam/ham Setup: Get a large collection of example emails, each labeled "spam" or "ham" TO BE REMOVED FROM FUTURE Note: someone has to hand label all this data! MAILINGS, SIMPLY REPLY TO THIS Want to learn to predict labels of new, future emails

Related Documents:

SonicWall™ Email Security 9

Anti‐Spam 3 10 Anti‐Spam Email Security uses multiple methods of detecting spam and other unwanted email. This chapter reviews the configuration information for Anti‐Spam: Spam Management Anti‐Spam Aggressiveness Languages Anti‐Spam Aggressiveness Spam Management

12 Views

1y ago

Accessing Your Spam Filtering. Spam Filtering User Guide

Spam Filter User Guide Page 3 Getting to Know Your Spam Filter Features. Your spam filter consists of four tabs: Messages, Settings, Policies, and Status. The default setting is Messages, which displays all of the messages quarantined by the spam filter. Managing Your Quarantined Messages. The Inbound quarantine section will show the

24 Views

1y ago

Spam Filter using Naïve Bayesian Technique

learn to identify spam e-mail after receiving training on messages that have been manually classified as spam or non-spam. A spam filter is a program that is mainlyemployed to detect unsolicited and unwanted email and prevent those messages from reaching a user's inbox. Just like other types of filtering programs, a spam filter looks for certain

23 Views

1y ago

PureMessage for Microsoft Exchange Help - sophos.com

Anti-spam scanning relates to incoming mail only , and in volv es chec king whether a message needs to be categorised as spam or suspected spam (depending on the spam rating of the message) and taking appropr iate action. A spam digest email and w eb based spam quar antine enables end users to manage their quarantined spam email.

21 Views

1y ago

Clustering Spam Domains and Hosts: Anti-spam Forensics With Data ... - Uab

Spam related cyber crimes, including phishing, malware and online fraud, are a serious threat to society. Spam filtering has been the major weapon against spam for many years but failed to reduce the number of spam emails. To hinder spammers' capability of sending spam, their supporting infrastructure needs to be disrupted.

9 Views

1y ago

Integrate Barracuda Spam Firewall - beta.netsurion.com

Barracuda Spam Firewall: Login and logout activity: All logs generated by Barracuda spam virus firewall when login or logout is happened on barracuda spam firewall web interface. Barracuda Spam Filter: User login success: This category provides information related to user login success into barracuda spam filter.

15 Views

1y ago

Common Spam Filtering Techniques

2 Spam detection accuracy is the industry -standard metric used to measure how accurate an anti spam filter is at correctly identifying spam. Generally, higher spam detection accuracy is obtained at the cost of a higher false positive rate. A good anti-spam filter will have an acceptable trade-off between the two metrics.

20 Views

1y ago

Vejledning i indstilling af SPAM filter - SKAGEN ANTENNELAUG

Vejledning i indstilling af SPAM filter Side 1 af 8. Vejledning i indstilling af SPAM filter . Kort gennemgang af hvad et SPAM filter er: SPAM er en engelsk forkortelse og betyder egentl igt uønsket e-mail. Den uønskede mail kan indeholde reklamer, kontakt annoncer , konkurrencer og meget mere.

8 Views

1y ago

Recent Views

Grammar as a Foreign Language - List of Proceedings

Grammar as a Foreign Language Oriol Vinyals Google vinyals@google.com Lukasz Kaiser Google lukaszkaiser@google.com Terry Koo Google terrykoo@google.com Slav Petrov Google slav@google.com Ilya Sutskever Google ilyasu@google.com Geoffrey Hinton Google geoffhinton@google.com Abstract Synta

2y ago

445 Views

Attention is All you Need - NIPS

Google Brain avaswani@google.com Noam Shazeer Google Brain noam@google.com Niki Parmar Google Research nikip@google.com Jakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.com Aidan N. Gomezy University of Toronto aidan@cs.toronto.edu Łukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia .

1y ago

303 Views

GSA Implementation of Google (G) Suite

Google Meet Classic Hangouts Google Chat Google Calendar Google Drive and Shared Drive Google Docs Google Sheets Google Slides Google Forms Google Sites Google Keep Apps Script D

2y ago

316 Views

Google Drive (Google Docs, Google Sheets, Google Slides)

Google Drive (Google Docs, Google Sheets, Google Slides) Employees are automatically issued a Kyrene Google account. Navigate to drive.google.com. Use Kyrene email address and network password to login. Launch in Chrome browser for best experience. Google Drive is a cloud storage sys

2y ago

388 Views

Quick Guide of Using Google Home to Control Smart Devices

Configuration needs Google Home app. Search "Google Home" in App Store or Google Play to install the app. 3.1 Set up Google Home with Google Home app You can skip this part if your Google Home is already set up. 1. Make sure your Google Home is energized. 2. Open the Google Home app by tapping the app icon on your mobile device. 3.

1y ago

326 Views

Elaboração de Provas Online usando o Formulário Google Docs

2 Após o login acesse o Google Drive ou o Google Docs e selecione a ferramenta Google Forms (Formulários). Clique na caixa de Ferramentas do Google, localizada no canto direito superior da tela e selecione o Google Drive. Na tela do Google Drive clique em New , opção More e selecione Google Forms. OBS: É possível acessar o google

11m ago

123 Views

ACS WASC Templates

File upload, Folder upload, Google Docs, Google Sheets, or Google Slides. You can also create Google Forms, Google Drawings, Google My Maps, etc. Share with exactly who you want — without email attachments. Search or sort your list of files, folders, and Google Docs. Preview files and Google Docs.

2y ago

366 Views

Google Drive - San Bernardino City Unified School District

Google Apps All of the Google applications that are available upon logging into Google.com (G , Gmail, Gphotos, Gdrive, etc.). Google Suite Google’s online cloud based office companion applications (Docs, Sheets, Slides). Google Drive Google’s online cloud storage and file sharing/collaboration application.

2y ago

378 Views

Single Sign On for Google Apps with NetScaler Unified Gateway

Google Apps for Work is a suite of cloud computing productivity and collaboration applications provided by Google on a subscription basis. It includes Google’s popular web applications including Gmail, Google Drive, Google Hangouts, Google Calendar and Google

2y ago

295 Views

Serviceteil

Google 84, 87, 124 Google 110 Google AdWords 101, 103 Google Alerts 127 Google Analytics 89 Google Maps 100, 110, 173 Google-Maps 63 Google Places 100, 103, 124 Graphiken 66 H Haftung 170 Haftungsausschluss 72 Hausfarbe 11 Headline 35 Heilmittelwerbegesetz 14, 69, 163 Heilversprechen 164 HONcode 78 HTML 58 HWG 31 I Imagefilm 31

2y ago

336 Views

Best practices for managing identities when you move to Google Cloud

Google Cloud. To provide t he informat ion an organizat ion would ne e d to transfer data and ownership from one Google Account to anot her for s ome of t he noncore Google s er vice s, such as Google Ads, Google Analyt ics, or DV360. Intende d audience Organizat ion administrators. Sta planning Google Cloud / Google Wor kspace migrat ion. Key .

1y ago

481 Views

MANAGERIAL FINANCE - GBV

of Managerial Finance page 2 Introduction to Managerial Finance 1 Starbucks—A Taste for Growth page 3 1.1 Finance and Business What Is Finance? 4 Major Areas and Opportunities in Finance 4 Legal Forms of Business Organization 5 Why Study Managerial Finance? Review Questions 9 1.2 The Managerial Finance Function 9 Organization of the Finance

3y ago

6.8K Views

Chapter 1 The roles of finance function in organisations

The roles of the finance function in organisations 4. The role of ethics in the role of the finance function Ethics is the system of moral principles that examines the concept of right and wrong. Ethics underpins an organisation’s sustained value creation. The roles that the finance function performs should be carried out in an .File Size: 888KBPage Count: 10Explore furtherRole of the Finance Function in the Financial Management .www.managementstudyguide.c Roles and Responsibilities of a Finance Department in a .www.pharmapproach.comRoles and Responsibilities of a Finance Department .www.smythecpa.comTop 10 – Functions of Business Finance in an om23 Functions and Duties of Accounting and Finance nded to you b

2y ago

335 Views

2013 National Senior Games presented by Humana Medal

3 martin cherie ann canada track & field 2 martin cherieann canada track & field 3 rossi elsie canada track & field 1 stuart pam canada track & field 2 stuart pam canada track & field 3 stuart pam canada track & field 1 stuart pam canada track & field 1 sleepers canada volleyball 3 volleyhawks canada volleyball 1 horiuchi kumi co archery

2y ago

176 Views

International Registered and Reporting Companies .

Dorel Industries Inc. Canada GLOBAL MKT Draxis Health Inc. Canada GLOBAL MKT Dundee Corp. Canada OTC DynaMotive Energy Systems Corp. Canada OTC Eiger Technology Inc. Canada OTC El Nino Ventures, Inc. Canada OTC Eldorado Gold Corp. Canada AMEX Elephant & Castle Group, Inc. Canada OTC Emgold Mining Corp. Canada OTC

1y ago

112 Views

Classification Example: Spam Filter - University Of California, Berkeley

It looks like you're using an ad-blocker