Classification Example: Spam Filter - University Of California, Berkeley

1y ago
2 Views
1 Downloads
2.26 MB
7 Pages
Last View : 5m ago
Last Download : 3m ago
Upload by : Esmeralda Toy
Transcription

Machine LearningCS 188: Artificial IntelligenceNaïve BayesUp until now: how use a model to make optimal decisionsMachine learning: how to acquire a model from data / experienceLearning parameters (e.g. probabilities)Learning structure (e.g. BN graphs)Learning hidden concepts (e.g. clustering, neural nets)Today: model-based classification with Naive BayesInstructors: Dan Klein and Pieter Abbeel --- University of California, Berkeley[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]ClassificationExample: Spam FilterInput: an emailOutput: spam/hamDear Sir.Setup:Get a large collection of example emails, each labeled“spam” or “ham”Note: someone has to hand label all this data!Want to learn to predict labels of new, future emailsFeatures: The attributes used to make the ham /spam decisionWords: FREE!Text Patterns: dd, CAPSNon-text: SenderInContacts, WidelyBroadcast Example: Digit RecognitionInput: images / pixel gridsOutput: a digit 0-9Classification: given inputs x, predict labels (classes) y0Examples:Setup:21Features: The attributes used to make the digit decisionPixels: (6,8) ONShape Patterns: NumComponents, AspectRatio, NumLoops Features are increasingly induced rather than craftedTO BE REMOVED FROM FUTUREMAILINGS, SIMPLY REPLY TO THISMESSAGE AND PUT "REMOVE" IN THESUBJECT.99 MILLION EMAIL ADDRESSESFOR ONLY 99Ok, Iknow this is blatantly OT but I'mbeginning to go insane. Had an old DellDimension XPS sitting in the corner anddecided to put it to use, I know it wasworking pre being stuck in the corner,but when I plugged it in, hit the powernothing happened.Other Classification Tasks1Get a large collection of example images, each labeled with a digitNote: someone has to hand label all this data!Want to learn to predict labels of new, future digit imagesFirst, I must solicit your confidence inthis transaction, this is by virture of itsnature as being utterly confidencial andtop secret. ?Medical diagnosis (input: symptoms,classes: diseases)Fraud detection (input: account activity,classes: fraud / no fraud)Automatic essay grading (input: document,classes: grades)Customer service email routingReview sentimentLanguage ID many moreClassification is an important commercial technology!

Model-Based ClassificationModel-Based ClassificationModel-based approachBuild a model (e.g. Bayes’ net) whereboth the output label and inputfeatures are random variablesInstantiate any observed featuresQuery for the distribution of the labelconditioned on the featuresChallengesWhat structure should the BN have?How should we learn its parameters?Naïve Bayes for DigitsGeneral Naïve BayesNaïve Bayes: Assume all features are independent effects of the labelA general Naive Bayes model:One feature (variable) Fij for each grid position i,j Feature values are on / off, based on whether intensityis more or less than 0.5 in underlying imageEach input maps to a feature vector, e.g.YYSimple digit recognition version: Y parametersF1F2F1Fn Y x F nvaluesF2n x F x Y parametersHere: lots of features, each is binary valuedWe only have to specify how each feature depends on the classTotal number of parameters is linear in nModel is very simplistic, but often works anywayNaïve Bayes model:What do we need to learn?Inference for Naïve BayesGeneral Naïve BayesGoal: compute posterior distribution over label variable YWhat do we need in order to use Naïve Bayes?Step 1: get joint probability of label and evidence for each labelInference method (we just saw this part)Start with a bunch of probabilities: P(Y) and the P(Fi Y) tablesUse standard inference to compute P(Y F1 Fn)Nothing new hereEstimates of local conditional probability tables Step 2: sum to get probability of evidenceStep 3: normalize by dividing Step 1 by Step 2P(Y), the prior over labelsP(Fi Y) for each feature (evidence variable)These probabilities are collectively called the parameters of the modeland denoted by θUp until now, we assumed these appeared by magic, but they typically come from training data counts: we’ll look at this soonFn

Example: Conditional ProbabilitiesNaïve Bayes for TextBag-of-words Naïve ures: Wi is the word at position iAs before: predict label conditioned on feature variables (spam vs. ham)As before: assume features are conditionally independent given labelNew: each Wi is identically distributedWord at positioni, not ith word inthe dictionary!Generative model:“Tied” distributions and bag-of-wordsUsually, each variable gets its own conditional probability distribution P(F Y)In a bag-of-words modelEach position is identically distributedAll positions share the same conditional probs P(W Y)Why make this assumption?Called “bag-of-words” because model is insensitive to word order or reorderingExample: Spam FilteringSpam ExampleModel:WordWhat are the parameters?ham : 0.66spam: 0.33the :to :and :of :you 0.00800.0075the :to :of :2002:with:from:and 00P(w spam)P(w ham)Tot 5-0.4P(spam w) 98.9Where do these tables come from?Training and TestingTot Spam(prior)Empirical Risk MinimizationEmpirical risk minimizationBasic principle of machine learningWe want the model (classifier, etc) that does best on the true test distributionDon’t know the true distribution so pick the best model on our actual training setFinding “the best” model on the training set is phrased as an optimization problemMain worry: overfitting to the training setBetter with more training data (less sampling variance, training more like test)Better if we limit the complexity of our hypotheses (regularization and/or smallhypothesis spaces)

Important ConceptsGeneralization and OverfittingData: labeled instances (e.g. emails marked spam/ham)Training setHeld out setTest setTrainingDataFeatures: attribute-value pairs which characterize each xExperimentation cycleLearn parameters (e.g. model probabilities) on training set(Tune hyperparameters on held-out set)Compute accuracy of test setVery important: never “peek” at the test set!Evaluation (many metrics possible, e.g. accuracy)Held-OutDataAccuracy: fraction of instances predicted correctlyOverfitting and generalizationWant a classifier which does well on test dataOverfitting: fitting the training data very closely, but notgeneralizing wellWe’ll investigate overfitting and generalization formally in a fewlecturesTestDataOverfittingExample: Overfitting302520Degree 15 polynomial151050-5-10-152 wins!!024681012141618Example: OverfittingPosteriors determined by relative probabilities (odds ly.::::::infinfinfinfinfinfscreensminuteguaranteed 205.00deliverysignature.What went wrong here?::::::infinfinfinfinfinf20Generalization and OverfittingRelative frequency parameters will overfit the training data!Just because we never saw a 3 with pixel (15,15) on during training doesn’t mean we won’t see it at test timeUnlikely that every occurrence of “minute” is 100% spamUnlikely that every occurrence of “seriously” is 100% hamWhat about all the words that don’t occur in the training set at all?In general, we can’t go around giving unseen events zero probabilityAs an extreme case, imagine using the entire email as the only feature (e.g. document ID)Would get the training data perfect (if deterministic labeling)Wouldn’t generalize at allJust making the bag-of-words assumption gives us some generalization, but isn’t enoughTo generalize better: we need to smooth or regularize the estimates

Parameter EstimationParameter EstimationEstimating the distribution of a random variabler bElicitation: ask a human (why is this hard?)bbrbb bbrbrr b bEmpirically: use training data (learning!)E.g.: for each outcome x, look at the empirical rate of that value:rrbThis is the estimate that maximizes the likelihood of the dataSmoothingMaximum Likelihood?Relative frequencies are the maximum likelihood estimatesAnother option is to consider the most likely parameter value given the data?Unseen EventsLaplace SmoothingLaplace’s estimate:Pretend you saw every outcomeonce more than you actually didCan derive this estimate withDirichlet priors (see cs281a)rrb

Laplace SmoothingEstimation: Linear Interpolation*In practice, Laplace often performs poorly for P(X Y):Laplace’s estimate (extended):Pretend you saw every outcome k extra timesrrbWhen X is very largeWhen Y is very largeAnother option: linear interpolationAlso get the empirical P(X) from the dataMake sure the estimate of P(X Y) isn’t too different from the empirical P(X)What’s Laplace with k 0?k is the strength of the priorLaplace for conditionals:What if α is 0? 1?Smooth each condition independently:For even better ways to estimate parameters, as well as details ofthe math, see cs281a, cs288Real NB: SmoothingTuningFor real classification problems, smoothing is criticalNew odds ratios:helveticaseemsgroupagoareas.: 11.4: 10.8: 10.2: 8.4: 8.3verdanaCreditORDER FONT money.:::::28.828.427.226.926.5Do these make more sense?Tuning on Held-Out DataNow we’ve got two kinds of unknownsParameters: the probabilities P(X Y), P(Y)Hyperparameters: e.g. the amount / type ofsmoothing to do, k, αWhat should we learn where?Learn parameters from training dataTune hyperparameters on different dataWhy?For each value of the hyperparameters, trainand test on the held-out dataChoose the best value and do a final test onthe test dataFeatures

Errors, and What to DoWhat to Do About Errors?Need more features– words aren’t enough!Examples of errorsDear GlobalSCAPE Customer,GlobalSCAPE has partnered with ScanSoft to offer you thelatest version of OmniPage Pro, for just 99.99* - the regularlist price is 499! The most common question we've receivedabout this offer is - Is this genuine? We would like to assureyou that this offer is authorized by ScanSoft, is genuine andvalid. You can get the . . . . . To receive your 30 Amazon.com promotional certificate,click through tohttp://www.amazon.com/appareland see the prominent link for the 30 offer. All details arethere. We hope you enjoyed receiving this message. However, ifyou'd rather not receive future e-mails announcing new storelaunches, please click . . .Have you emailed the sender before?Have 1K other people just gotten the same email?Is the sending information consistent?Is the email in ALL CAPS?Do inline URLs point where they say they point?Does the email address you by (your) name?Can add these information sources as newvariables in the NB modelNext class we’ll talk about classifiers which letyou easily add arbitrary features more easily,and, later, how to induce new featuresBaselinesFirst step: get a baselineBaselines are very simple “straw man” proceduresHelp determine how hard the task isHelp know what a “good” accuracy isWeak baseline: most frequent label classifierGives all test instances whatever label was most common in the training setE.g. for spam filtering, might label everything as hamAccuracy might be very high if the problem is skewedE.g. calling everything “ham” gets 66%, so a classifier that gets 70% isn’t very good For real research, usually use previous work as a (strong) baselineSummaryBayes rule lets us do diagnostic queries with causal probabilitiesThe naïve Bayes assumption takes all features to be independent given the class labelWe can build classifiers out of a naïve Bayes model using training dataSmoothing estimates is important in real systemsClassifier confidences are useful, when you can get themConfidences from a ClassifierThe confidence of a probabilistic classifier:Posterior probability of the top labelRepresents how sure the classifier is of the classificationAny probabilistic model will have confidencesNo guarantee confidence is correctCalibrationWeak calibration: higher confidences mean higher accuracyStrong calibration: confidence predicts accuracy rateWhat’s the value of calibration?Next Time: Discriminative Learning

Classification Example: Spam Filter Input: an email Output: spam/ham Setup: Get a large collection of example emails, each labeled "spam" or "ham" TO BE REMOVED FROM FUTURE Note: someone has to hand label all this data! MAILINGS, SIMPLY REPLY TO THIS Want to learn to predict labels of new, future emails

Related Documents:

Anti‐Spam 3 10 Anti‐Spam Email Security uses multiple methods of detecting spam and other unwanted email. This chapter reviews the configuration information for Anti‐Spam: Spam Management Anti‐Spam Aggressiveness Languages Anti‐Spam Aggressiveness Spam Management

Spam Filter User Guide Page 3 Getting to Know Your Spam Filter Features. Your spam filter consists of four tabs: Messages, Settings, Policies, and Status. The default setting is Messages, which displays all of the messages quarantined by the spam filter. Managing Your Quarantined Messages. The Inbound quarantine section will show the

learn to identify spam e-mail after receiving training on messages that have been manually classified as spam or non-spam. A spam filter is a program that is mainlyemployed to detect unsolicited and unwanted email and prevent those messages from reaching a user's inbox. Just like other types of filtering programs, a spam filter looks for certain

Anti-spam scanning relates to incoming mail only , and in volv es chec king whether a message needs to be categorised as spam or suspected spam (depending on the spam rating of the message) and taking appropr iate action. A spam digest email and w eb based spam quar antine enables end users to manage their quarantined spam email.

Spam related cyber crimes, including phishing, malware and online fraud, are a serious threat to society. Spam filtering has been the major weapon against spam for many years but failed to reduce the number of spam emails. To hinder spammers' capability of sending spam, their supporting infrastructure needs to be disrupted.

Barracuda Spam Firewall: Login and logout activity: All logs generated by Barracuda spam virus firewall when login or logout is happened on barracuda spam firewall web interface. Barracuda Spam Filter: User login success: This category provides information related to user login success into barracuda spam filter.

2 Spam detection accuracy is the industry -standard metric used to measure how accurate an anti spam filter is at correctly identifying spam. Generally, higher spam detection accuracy is obtained at the cost of a higher false positive rate. A good anti-spam filter will have an acceptable trade-off between the two metrics.

Vejledning i indstilling af SPAM filter Side 1 af 8. Vejledning i indstilling af SPAM filter . Kort gennemgang af hvad et SPAM filter er: SPAM er en engelsk forkortelse og betyder egentl igt uønsket e-mail. Den uønskede mail kan indeholde reklamer, kontakt annoncer , konkurrencer og meget mere.