Email Filtering Using Bayesian Method - IJSER

1y ago
3 Views
1 Downloads
654.70 KB
5 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Giovanna Wyche
Transcription

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015ISSN 2229-5518988Email Filtering Using Bayesian MethodNadia Al-BakriAbstract: Electronic mail is inarguably the most widely used Internet technology today. With the massive amount of information and speedthe Internet is able to handle, communication has been revolutionized with email and other online communication systems. However, somecomputer users have abused the technology used to drive these communications, by sending out thousands and thousands of spamemails with little or no purpose other than to increase traffic or decrease bandwidth.This paper evaluates the effectiveness of email filtering based on the Bayesian method to construct automatically anti-spam filters withsuperior performance. Bayesian e- mail classifier is trained automatically to detect spam messages. A test is performed on a largecollection of personal e-mails taken from email server using POP3 protocol. The results had shown that using Bayesian method in filteringprocess yields an enhancement in filter performance.keywords: Bayesian, Spam, Ham, Filter, E-mail, Naïve, pop3.—————————— ——————————1-INTRODUCTIONEmail spam known as unsolicited bulk Email (UBE), junkmail, or unsolicited commercial email (UCE), is the practiceof sending unwanted email messages, frequently withcommercial content, in large quantities to an indiscriminateset of recipients. Spam in email started to become aproblem when the Internet was opened up to the generalpublic in the mid-1990s. It grew exponentially over thefollowing years, and today composes some 80 to 85% of allthe email in the world, by a conservative estimate [1].Pressure to make email spam illegal has been successful insome jurisdictions, but less so in others. Spammers takeadvantage of this fact, and frequently outsource parts oftheir operations to countries where spamming will not getthem into legal trouble.Attempts to introduce legal measures against spam mailinghave had limited effect [2]. A more effective solution is todevelop tools to help recipients identify or removeautomatically spam messages. Such tools, called anti-spamfilters, vary in functionality from blacklists of frequentspammers to content-based filters. The latter are generallymore powerful, as spammers often use fake addresses.Existing content-based filters search for particular keywordpatterns in the messages. These patterns need to be craftedby hand, and to achieve better results they need to be tunedto each user and to be constantly maintained a tedious task,requiring expertise that a user may not have [3].The issue of anti-spam filtering was addressed with the aidof machine learning. A supervised learning method wasexamined, which learn to identify spam e-mail afterreceiving training on messages that have been manuallyclassified as spam or non-spam.2-Types of Spam FiltersSpam filters work using a combination of techniquesin order to filter through the messages and separate thegenuine messages from the junk mail.These techniques would rely on the following measures[4]: Word lists – Lists of words that are known to beassociated with spam and are commonly found inunsolicited mail messages, such as ‘sex’ or ‘mortgage’. Blacklists and Whitelists – These lists contain known IPaddresses of spam senders (blacklists) and non-spamsenders (e.g. friends and family). Therefore addressesthat form part of the contact list are automaticallyregistered as whitelist and any emails originating fromthese email addresses will be sent directly to the inbox. Trend Analysis – By analyzing the history of email sentfrom an individual, trends can help assess the likelihoodof an email being genuine or spam. Learning or Content filters – Learning filters such asBayesian filtering, examine the content of each emailsent to and from an email address, and by learningword frequencies and patterns associated with bothspam and non-spam messages, it is able to recognizewhich messages are valid and should therefore bedirected towards the inbox, and which are spam andshould be sent to Junk.IJSERAssistance lecturer in computer science department.AL Nahrain University, Baghdad, Iraq.Email: nadiaf 1966@yahoo.com3-Classification of e-mail messagesIJSER 2015http://www.ijser.orgWe now turn to the learning algorithm we experimentedwith.3.1 Naive Bayesian classificationBayesian filter is a statistical technique of e-mailfiltering. In its basic form, it makes use of a naive Bayesclassifier on bag of words features to identify spam e-

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015ISSN 2229-5518mail, an approach commonly used in text classification.Bayesian filtering is based on the principle that mostevents are dependent and that the probability of anevent occurring in the future can be inferred from theprevious occurrences of that event. This sametechnique can be used to classify spam. If some piece oftext occurs often in spam but not in legitimate email,then it would be reasonable to assume that this email isprobably spam. Naive Bayes classifiers work bycorrelating the use of tokens (typically words, orsometimes other things), with spam and non-spam emails and then using Bayesian inference to calculate aprobability that an email is or is not spam. It is one ofthe oldest ways of doing spam filtering, with roots inthe 1990s [2].From Bayes’ theorem and the theorem of totalprobability, the probability that a document d withvectorbelongs to category c is[5]:989Where t threshold value and λ number of spammessages.4-Email Server Connection4.1 POP3 (Post Office Protocol, version 3)In computing, the Post Office Protocol (POP) is anapplication-layer Internet standard protocol used by locale-mail clients to retrieve e-mail from a remote server overa TCP/IP connection [7]. POP supports simple downloadand-delete requirements for access to remote mailboxes.Although most POP clients have an option to leave mailon server after download, e-mail clients using POPgenerally connect, retrieve all messages, store them on theuser's PC as new messages, delete them from the server,and then disconnect. Many e-mail clients support POP toretrieve messages.5- The Proposed method DesignThe design of the proposed method for emailfiltering spam messages is discussed below as phases:5.1 Training the proposed E-mail filterBefore email can be filtered using this method, the userneeds to generate a database with words. The followingsteps show the training process.IJSERIn practice, the yingassumptions, because the possible values of X are toomany and there are also data sparseness problems. TheNaive Bayesian classifier assumes that n X1 Xn areconditionally independent given the category C, whichyields:P (Xi C) and P(C) are easy to estimate from thefrequencies of the training corpus.A message is classified as spam if the followingcriterion is met:A-Connect to the database (Microsoft Access)In this step, need to connect to the database by specify theprovider of for type of database and the source (location)of database.B- Create database of spam and ham words1-Microsoft Office Access has been used to create 2 tables.The first table contains two fields (spam words collectedfrom a sample of spam email recognize it as spam becauseof certain key words (such as “Viagra” and “mortgage”)and its occurrences in spam messages and the second tablecontains the ham words and its occurrences in hammessages. Records for each field has list of some words asillustrated in table 1 and table 2.Table 1 list of some spam wordsTo the extent that the independence assumption holds andthe probability estimates are accurate, a classifier based onthis criterion achieves optimal results [6]. In our case,and the classification criterion is equivalent to:Spam Table 2 list of some ham wordsIJSER 2015http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015ISSN 2229-5518spam or not. For each message retrieved from the server,each word is gotten separately by using Regex (regulatorExpression).Ham 554today1305yahoo206hello797come238tonight109905.4 Elimination of stop wordsAfter initial indexing, it will be discovered that thedocument index contained useless terms, to decrease thenumber of terms in the index; it is desired to be filtered byremoving stop words. a number of (1500) words issuggested as stop words, including the ordinary stopwords similar to “the”, “which”, “is”, and numbers. Alsoan extracted or suggested stop words similar to “repeat”,“high”, “width”, “second”, “first”.Examples of training the filter on these short spammessages:Best quality drugsWorldwide shippingUSPS - Fast Delivery Shipping 1-4 day USAProfessional packaging100% guarantee on deliveryBest prices in the marketExamples of training the filter on these short spammessages:Important meeting today at noon.When is the next time you’re coming home to visit?Let’s all meet at the diner for breakfast.5.5-Calculate the number of iterated words1- Find matched words of current email message with spamand ham words found in database.2-Find how many times does word of current message hasbeen iterated for both spam and ham words.5.6 Bayesian Filter ProcessIJSERIn this step will apply the Bayesian filter. For each iteratedwords (spam and current message) divided by the numberof total messages.The formula used by the proposed method which isderived from Bayes' theorem:5.2 Connect to the serverA connection is needed to the server using (POP3) byspecifying the server name, port, and security mode. Theserver name used is Yahoo, and the port is (995).Receive emails using POP3 code:Using pop3 As New ser", "password")Receive all messages and display the subjectDim builder As New MailBuilder()For Each uid As String In pop3.GetAll()Dim email As IMail ail.Text)Nextpop3.Close()End Using5.3 Parse words of current messageThe proposed filter will split the message into tokens andbuild a table of all the tokens it intends to use in thedecision making process. Tokens are taken from the bodyand subject of email. This filter uses single words in thecalculations to decide if a message should be classified asPr(S W) is the probability that a message is spamPr(S)is the overall probability that any givenmessage is spamPr(W S) is the probability that the word appears inspam messagePr(H) is the overall probability that any givenmessage is not spam ( is ham)Pr(W H) is the probability that the word appears inham message.5.7 Calculate the SpamicityUThe email filter calculates the word’s spamicity and theprobability of spam message as shown in the followingpseudo code:Create table 3 in database with 4 fields, first field towords of current message, spam probability, hamprobability and spamicity value of each word.For each word in current messageStore word in table 3.If word is stop word then Read next word.Read the frequency of the word in spam table.Read the frequency of the word in ham table.IJSER 2015http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015ISSN 2229-5518If word frequency 2 then Spamicity 0.4Numerate number of spam messages the filterhas been trained on.Numerate number of ham messages the filter hasbeen trained on.Ham probability frequency of word in hamtable / Number of ham messages trained on.Spam probability frequency of the word inspam table / Number of spam messages trainedon.If Ham probability 1 thenHam probability ther 1771710.4037cash1318490.8737contact 15527600.3445death118370.451family r a message is spam or not. The spamicity is basedon the number of times a word occurs in spam messages asopposed to the number of times it occurs in non-spammessages. Table 3 shows the 4 field’s generation.If spam probability 1 thenSpam probability 1Word Spamicity Spam probability / (Hamprobability Spam probability)Table 3 the 4 field’s generationIJSERStore Word word Spamicity in table 3.Until end of words in current message.Choose 30 words from table 3The spam probability of current message multiplicationof30wordsspamicity/(multiplication of 30 word spamicity) multipliedby (1-spamicity) for each word.If probability of current message 0.5 thenSpam Current messageElseHam current message6- ResultTo validate the proposed filter, a corpus of 1700 actual email messages of which 900 messages are pre-classified asjunk and 800 messages are pre-classified as legitimate wereconducted. A result was shown that the proposed filterworked more efficiently than other techniques like usingpublic black and white lists. The proposed filter uses 30most “interesting” words to calculate the message’s overallspamicity. These 30 words are the words in the messagethat have either the highest or lowest spamicity (i.e. areclosest to 0 or 1 in value). The spamicity value assigned toeach word ranges from 0.0 to 1.0. A spamicity value of 0.5 isneutral, meaning that it has no effect on the decision as to7- ConclusionIn examining the growing problem of dealing with junk Email, we have found that it is possible to automaticallylearn effective filter to eliminate a large portion of junkfrom a user's mail stream. It’s also important that the emailfilter be trained on spam and non-spam messages from userinbox. If an email filter is pre-trained on messages fromanother site, it won’t be able to identify features specific tomessages destined for the user. This can easily lead to largenumbers of false positives and low spam detectionaccuracy.The accuracy of such filters is greatly enhanced byconsidering not only the full text of the E-mail messages tobe filtered, but also a set of hand-crafted features which arespecific for the task at hand.A plan for future is to explore a method deals with phrasesbesides words.8-Reference1- https://en.wikipedia.org.2- Androutsopoulos I., J. Koutsias, K.V. Chandrinos, andC.D. Spyropoulos. 2000b. An Experimental Comparison ofNaive Bayesian and Keyword-Based Anti-Spam Filteringwith Encrypted Personal Messages. Proceedings of the 23rdAnnual International ACM SIGIR Conference on Researchand Development in Information Retrieval, Athens, Greece.3- Cranor, L.F. and B.A. LaMacchia. 1998. Spam!Communications of ACM, 41(8):74–83.4- Spector,Lincoln. "GuidetoSpammingtheSpammers". About.com.IJSER 2015http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015ISSN 2229-55185- Friedman, N., D. Geiger and M. Goldszmidt. 1997.Bayesian Network Classifiers. Machine Learning,29(2/3):131–163.9926- Duda, R.O. and P.E. Hart. 1973. Bayes Decision Theory.Chapter 2 in Pattern Classification and Scene Analysis,pages 10–43. John Wiley.7- Dean, Tamara (2010). Netw ork Guide to Networks.Delmar. p. 519.IJSERIJSER 2015http://www.ijser.org

1- Find matched words of current email message with spam and ham words found in database. 2-Find how many times does word of current message has been iterated for both spam and ham words. 5.6 Bayesian Filter Process In this step will apply the Bayesian filter. For each iterated words (spam and current message) divided by the number of total .

Related Documents:

3 filtering and selective social filtering),6 Algeria (no evidence of filtering),7 and Jordan (selective political filtering and no evidence of social filtering).8 All testing was conducted in the period of January 2-15, 2010.

EPA Test Method 1: EPA Test Method 2 EPA Test Method 3A. EPA Test Method 4 . Method 3A Oxygen & Carbon Dioxide . EPA Test Method 3A. Method 6C SO. 2. EPA Test Method 6C . Method 7E NOx . EPA Test Method 7E. Method 10 CO . EPA Test Method 10 . Method 25A Hydrocarbons (THC) EPA Test Method 25A. Method 30B Mercury (sorbent trap) EPA Test Method .

Computational Bayesian Statistics An Introduction M. Antónia Amaral Turkman Carlos Daniel Paulino Peter Müller. Contents Preface to the English Version viii Preface ix 1 Bayesian Inference 1 1.1 The Classical Paradigm 2 1.2 The Bayesian Paradigm 5 1.3 Bayesian Inference 8 1.3.1 Parametric Inference 8

value of the parameter remains uncertain given a nite number of observations, and Bayesian statistics uses the posterior distribution to express this uncertainty. A nonparametric Bayesian model is a Bayesian model whose parameter space has in nite dimension. To de ne a nonparametric Bayesian model, we have

SonicWALL Content Filtering feature. A Web browser is used to access the SonicWALL Management interface, and the commands and functions of Content Filtering. The following sections are in this chapter: Accessing the SonicWALL using a Web browser Enabling Content Filtering and Blocking Customizing Content Filtering

WebTitan Web Filtering and URL Filtering Categories: The 53 Categories available in Web Titan for Web Filtering and URL Filtering: 1.Alcohol: Web pages that promote, advocate or sell alcohol including beer, wine and hard liquor. 4.Business/Services: General business websites. 7.Community Sites: Newsgroup sites and posting including

content-based, which utilize user personal and social data. 3.4 Collaborative filtering The Collaborative filtering method for recommender systems is a method that is solely based on the past interactions that have been recorded between users and items, in order to produce new recommendations. Collaborative Filtering tends to find what similar

ASME A17.1 / CSA B44 (2013 edition) Safety Code for Elevators and Escalators ASME A18.1 (2011 edition) Safety Standard for Platform Lifts and Stairway Chairlifts . 3 Other codes important to conveyances adopted through state codes or as secondary references include the following: ASME A17.6 (2010 edition) Standard for Elevator Suspension, Compensation and Governor Systems ASME A17.7 / CSA B44 .