An Effective Algorithm For Spam Filtering And Cluster Formation - IJCERT

1y ago
3 Views
1 Downloads
1.04 MB
8 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Eli Jorgenson
Transcription

Volume 3, Issue 12, December-2016, pp. 659-666ISSN (O): 2349-7084International Journal of Computer Engineering In Research TrendsAvailable online at: www.ijcert.orgAn Effective algorithm for Spam Filteringand Cluster FormationKavitha GudaAssociate Professor, Department of Computer Science and Engineering.Vishwa Vishwani Institute of Technology, Hyderabad.Telangana, India.Abstract:- K-means clustering algorithm is one of the most widely used partitioning algorithms used for grouping theelements over spatiotemporal data. It is the fast, simple and can work with large datasets. It has some of the pitfallsregarding Number of iterations are more due to clusters details not known at an initial stage. It can detect only sphericalclusters. Here we will propose a Hybrid K-Means clustering algorithm which will mostly work on the concept of splittingdataset and reducing the number of iterations. It will inherit the some of the features from two revised K-meansalgorithms. The advantage of separating more massive datasets is that handle easy, and the benefit of reducingiterations leads the easy cluster formation in this way the efficiency of the traditional K-means clustering algorithm isincreased. Furthermore, we also proposed Naïve Bayes Algorithm for Email Spam Filtering on SPAMBASE Dataset.Keywords: Data Mining, KDD, E-Mail, Spam, Naïve Bayes Algorithm, Spam Filter, K-Means Algorithm, Hybrid Kmeans Algorithm, SPAMBASE dataset.—————————— ——————————1. INTRODUCTIONTremendous growth of data since from fewdecades is unusually high. The cause behind theterrific increase in the size and the complexity of thedata is due to various online commercial sites, workperformed in the engineering field and other socialmedia sites like Facebook, twitter, LinkedIn, andyoutube etc. The internet contains a huge amount ofraw data, to process the data several tools andtechniques are used for the effective extraction ofFigure 1. Steps in the KDD processrelevant data. Data Mining is a process establishedKnowledge or facts from the data can be acquired byfor the possible retrieval of unseen information forundergoing many steps related to each other.the sake of gaining knowledge. Facts can vary inInformation mining is also categorized as Knowledgedimension, difficulty to the formation. Data can bedetection method, which means an action to extractrepresented in the form of audio, video or simply avaluable data from a collection of untreated data.text data in the alphabetic or numeric form. DataData mining is a concentric part of knowledgemining is desirable, to tackle the large volume of datadiscovery [1]and to extract needed properties from the group ofthe data.Collection of Raw Data:Data-group can begathered from various sources like online andoffline, social media sources, public sector banks, 2016, IJCERT All Rights ReservedDOI:10.22362/ijcert/2016/v3/i12/4321Page 659

Kavitha Guda, “An Effective algorithm for Spam Filtering and Cluster Formation”, International Journal Of ComputerEngineering In Research Trends, 3(12):659-666,December-2016.banks, etc. Electronic mail Spam Electronic mails areclassified into two broad categories: SpamData Selection: Data can vary in large volume, soemails and Ham emails. Spam emails are theit is necessary to extract relevant and essential dataunauthenticated emails received from the unknownthat is required for the further processing is selected.sources that may contain the virus. Spam canData Pre-Processing:originate from any external source like Web, Textretail sector, Insurance companies, Private sectorRaw data can contain falseinformation in the missing values or noise form. So,messages,it is mandatory to pre-process the dataset, to removebroadcast; spam can be categorized into a variety ofany vague or incorrect data.category similar to electronic mail spam, web spam,Transformation:text spam, social networking spam [3].The data is transformed intoetc.,dependinguponthekindofsuitable shape so that mining job can be carried out.The spam emails are scattering at the great pace dueData Mining:Finding the relevance of the data isto the swift and offensive way of contribution data. Itcalled as data mining. A variety of data miningwas noticed that account holders receive more spamapproaches can be utilized to carry out theemails than the ham emails. To avoid spam emails,application in the data.spam filtration is important because spam can lead toEvaluation:time, energy, and bandwidth wastage, along with theGained information is evaluated forthe exactness of the patterns and its compactness.Knowledge:The final required information iscalled as Knowledgemisleading information [4]. Email can be labeled as aspam email only depending on these properties: Uninvited Emails: E-mails that are received fromcontacts that are not known to the user. Bulk Mailing: The kind of emails that are sent inDifferent Methods of Data Mining:The diverse methods relevant in data mining areconsidered as mentionedunderneath [2].Themass or bulk to multiple account holders at the sametime.following steps are performed on raw data to gain Unknown Mails:and access Knowledge.identity and the details of the sender are not revealed Anomaly Detection:Collected information thatcan be irrelevant or bogus is detected which istermed as an Anomaly or fake. Anomaly detectiontracks the information that contributes to no fact orknowledge. Association Rule Mining (ARM):It is aprocedure of establishing a relationship between theitems in the dataset.In this type of mails in theor demonstrated.For instance, when the user received a large amountof e-mail spam, the chance of the user forgot to read anon-spam message increase. As a result, many e-mailreaders have to spend their time removing unwantedmessages. E-mail spam also may cost money to userswith dial-up connections, waste bandwidth, and mayexpose minors to unsuitable content. Over the pastmany years, many approaches have been provided to Clustering: It is a procedure that labels the similarblock e-mail spam [5]. For filtering, some email spamtype of data in one group called as clusters withoutis not being labeled as spam because the e-mailknowing any predefined model. The expressivefiltering does not detect that email as spam. Someprocess of grouping the data.existing problems are regarding accuracy for email Classification:It is a procedure that has aspam filtering that might introduce some error.predefined known structure which groups the dataSeveral machine learning algorithms have been usedintoClassificationin spam e-mail filtering, but Naïve Bayes algorithm ismodeling is a predictive model for grouping the data.particularly popular in commercial and open-sourceIt helps to target data to different classes.spam filters [6]. This is because of its simplicity, Summarization: A process of labeling the data inwhich makes them easy to implement and just needa compact form so that we can visualize andshort training time or fast evaluation to filter emailrepresent it.spam. The filter requires training that can beknownpredefinedgroups.provided by a previous set of spam and non-spam 2016, IJCERT All Rights ReservedDOI:10.22362/ijcert/2016/v3/i12/4321Page 660

Kavitha Guda, “An Effective algorithm for Spam Filtering and Cluster Formation”, International Journal Of ComputerEngineering In Research Trends, 3(12):659-666,December-2016.messages. It keeps track of each word that occursIt is found experimentally that the customizedonly in spam, in non-spam messages, and in both.algorithm k mean shows enhanced performance asNaïve Bayes can be used in different datasets wherecompared to the traditional k mean algorithm and keach of them has different features and characteristic.medoid on the same dataset [8].Kwai Han et al.2. RELATED WORKclarifiesthatinformationconcentrated shared (p2p) systems are findingShi Na et al. has first explained theexpanding number of uses. Information mining incharacteristics of k mean algorithm and then a newlysuch P2P surroundings is a typical development. Beenhanced k means algorithm is planned that reducesthat as it may, common solid information miningthe measure of iterations. The improved algorithmconfiguration doesn't fit well in these sort ofavoids the calculation of the distance of each object tosurroundings as they more often than not requirethe cluster center again and again. First, it randomlybringing together the scattered information which ischooses K data points and calculates the first clusterfrequently not reasonable in a gigantic P2P arrange.centers by smallest Euclidean distance. Two arraysCirculated information mining calculations thatare used to store the smallest distance of the clusters.avoid huge scale synchronization or informationThe second one is used to store the cluster center ofcentralization propose a distinctive decision. Thisthe object. This information is useful in reducing thepaper considers the scattered k-implies groupingnumber of times the loops are executed. In this way,exertionit reduces the efficiency of the k-mean algorithm byresources are spread over a vast P2P arrange. It offersincreasing the execution speed. Two different typestwo calculations which manufacture a gauge of theof datasets are used. Then both the k means andoutcomes made by the standard concentrated k-meanenhanced k mean algorithms are run on the dataset.bunching calculation. The essential is intended toThe experiments show that the enhanced k meanwork in a dynamic P2P arrange that can makealgorithmgrouping by limited synchronization as it were. nandfiguringcompared to traditional k mean algorithm [7].following calculation utilizes reliably inspected peersSourabh Shah et al. has taken three algorithms intoand gives intelligent certifications concerning theconsideration in this paper. K medoid, k mean andaccuracy of bunching on a p2p arrange. Exploratorymodified k mean algorithms are compared. In PAMoutcomes representalgorithm initially, K objects are chosen as medoids.uncover excellent execution contrasted with theirThen we calculate the distance of each object with theconcentratedmedoid and in this way we assign the object to thecorrespondence cost [10]medoid with the smallest distance. In this way, everyKonstantin Tretyakov et al.,data item is allocated to the adjoining medoid. Inseveral most popular machine learning methods, i.e.,next step, swapping is done. We swap a medoid mBayesian classification, k-NN, ANNs, SVMs and ofwith nonmedoid o. Again the same procedure istheir applicability to the problem of spam-filtering. Infollowed. New cost is calculated. If this cost us lesserthis work, the author proposed most trivial samplethan the previous cost, then the newly chosen objectimplementation of the named techniques and thebecomes the medoid. After this iteration, we swapcomparison of their performance on the PU1 spamthe non-medoid with the medoid, and the samecorpus dataset is presented. The author usedprocedure is repeated. The whole process continuesextracting feature to convert all messages to vectorsuntil there is no change in the rate of medoids. Theof numbers (feature vectors) and then classify thesecustomized k means algorithm is as well described. Itvectors. This is because most of the machine learningis approximately alike to the k mean algorithm. Thealgorithms can only classify numerical objects likeonly difference is that in modified k mean sunobtrusive[11] have evaluatedinstead of implementing k mean on whole of thedataset; the dataset is split into smaller parts or3. METHODOLOGYsubparts. Then k means is applied on these subparts. 2016, IJCERT All Rights ReservedDOI:10.22362/ijcert/2016/v3/i12/4321Page 661

Kavitha Guda, “An Effective algorithm for Spam Filtering and Cluster Formation”, International Journal Of ComputerEngineering In Research Trends, 3(12):659-666,December-2016.3.1 MethodologyAlgorithmK-MeanisforHybridK-MeansThen that dataset is divided into the smallerdataset. Then we run an algorithm which is modifiedthetraditionalpartitioningform of the k-mean clustering algorithm. In thisalgorithm. Till now various researchers have used italgorithm,in many fields like biology, insurance, banking,repetitions in k-mean clustering algorithm whichmarketing, etc. it has faced many modificationsincreases its efficiency. By dividing the dataset intobecause it faces various drawbacks like we need tosmaller datasets, we have made the traditional meantell the number of clusters initially, how to choosemore robust in the way that now we can deal withinitial points, the large number of iterations. Till date,comparatively larger datasets as compared to themany types of research have given their solutions forconventional k-mean algorithm. While doing thethesehybridresearch, the methodology we adopted is that first ofapproach. Some researchers have reduced theall we collected the data on data mining which iscalculations by using their methods to increase theknown as literature survey. Then the second step wasspeed. Some researchers have used a differentto choose the main topic in data mining on which wemethod to choose initial clusters. Others have usedwant to precede our research. Clustering was chosentheir methods to choose the no of cluster centersas the main topic. Some research papers were studiedwhere as some researchers have used the median,to find the problem definition. Here we deal basicallymode or max-min distance to find the minimumwith the k-mean algorithm which is partitioningdistance. K-means deals with many problems like itclustering algorithm. The data relevant to the k-meanare hard to assume the significance of K. For differentalgorithm was collected, and to deal with thatvalues of K, clusters we get are different. It worksproblem, we present here an enhanced k-meansonly with numerical data. It is not capable ofclustering algorithm. Mainly means is an algorithmdetecting the noise and outliers. It puts all the datato select the early ideals to go after K-meansinto clusters. It cannot deal with irregular shapes. Itclustering algorithm. If we choose wrong clusterscannot work with very large datasets. It does notinitially, it leads to poor clustering. The k-meanswork well with clusters of diverse thickness. With thealgorithm initialized with a random set of groupanalysis of k means algorithm, we have found thatcenters. We introduced a different way of selectingwe can try to improve its speed or increase itsthe centres and then some method to reduce theefficiency by using our approach and moreover thenumber of iterations. Basically, we are splitting thealgorithm can be enhanced to deal with the verydata into smaller sets and then implementing amlarge datasets. We can make it more robustalgorithm on these smaller datasets to reduce thecomparatively. So what we have done is that we takenumber of iterations.a dataset first. Then that dataset is divided into theSteps for Proposed algorithms:smaller dataset. Then we run an algorithm which isFirst of all, draw multiple sub-samples from themodified form of the k-mean clustering algorithm. Inoriginal data set.this algorithm, we have abridged the number of From every subsample arbitrarily choose k itemsrepetitions in k-mean clustering algorithm whichfrom the dataset as initial cluster centres.increases its efficiency. But dividing the dataset into Compute the area among every data items and allsmaller datasets, we have made the traditional k-cluster mid-points as Euclidean area and allocatemean more robust in the way that now we can dealdata items to the adjoining clusters.with comparatively larger datasets as compared to For every data item, locate the nearby centre andthe traditional k-mean algorithm. In our study, weset the instance to cluster centre.have merged two approaches basically, one is Store the tag of cluster middle in which data item issplitting the dataset into smaller datasets, and otherand all the space of data item to the adjoining clusteris reducing the number of iterations. What we haveand accumulate them.done is that we take a dataset first. Recalculate the cluster centre for each cluster.problems.Somehaveused the 2016, IJCERT All Rights Reservedwehaveabridged theDOI:10.22362/ijcert/2016/v3/i12/4321number ofPage 662

Kavitha Guda, “An Effective algorithm for Spam Filtering and Cluster Formation”, International Journal Of ComputerEngineering In Research Trends, 3(12):659-666,December-2016. For every data item compute its space to the centreof the current adjoining cluster, if this space is fewerFigure 2. Practice of E-mail spam filteringbased on Naive Bayes Algorithmthan previous distance, the data item stays in the firsteach data item to all the centre, allocate data item toStage2:Feature Selectionthe adjoining middle.After the pre-processing step, we apply the feature For every cluster, centre recalculates the centresselection algorithm, the algorithm which deploysuntil convergences criteria meet.here is Best First Feature Selection algorithm [13]. Yield the clustering results.Dataset 1: SPAMBASEcluster else for all cluster centre calculate the space of3.2 Methodology for Email Spam FilteringThe methodology that is used for the filteringmethod is machine learning techniques that divideinto three phases.(i) Stage1: Pre-processing(ii) Stage2: Feature SelectionSource: Mark Hopkins, Erik Reeber, George Forman,(iii) Stage3: Naive Bayes ClassifierThe following sections will explain the activities thatinvolve in each phase to develop this project. Figure2. Shows the process for e-mail spam filtering basedon Naïve Bayes algorithm.Stage: 1Pre-processingJaap Suermondt Hewlett-Packard Labs, 1501 PageMill Rd., Palo Alto, CA 94304Stage3:Naive Bayes ClassifierThe methodology is used for the process of e-mailToday, most of the data in the real world arespam filtering based on Naive Bayes algorithm.incomplete containing aggregate, noisy and missingNaive Bayes classifier The Naive Bayes algorithmvalues [12]. Pre-processing of e-mails in next step ofis a simple probabilistic classifier that calculates a settraining filter, some words like conjunction words,of probabilities by counting the frequency andarticles is removed from email body because thosecombination of values in a given dataset [4]. In thiswords are not useful in classification. As mentionedresearch, Naive Bayes classifier use bag of wordsearlier, we are using WEKA tool to facilitate thefeatures to identify spam e-mail and a text isexperiments. For both experiments, the datasets arerepresenting as the bag of its word. The bag of wordspresented in Attribute-Relation File Format (ARFF)is always used in methods of document classification,filewhere the frequency of occurrence of each word isused as a feature for training classifier. This bag ofwords features are included in the chosen datasets.Naive Bayes technique used Bayes theorem todetermine that probabilities spam e-mail. Somewords have particular probabilities of occurring inspam e-mail or non-spam e-mail. Example, supposethat we know exactly, that the word Free could neveroccur in a non-spam e-mail. Then, when we saw amessage containing this word, we could tell for surethat was spam email. Bayesian spam filters havelearned a very high spam probability for the wordssuch as Free and Viagra, but a very low spamprobability for words seen in the non-spam e-mail,such as the names of friend and family member. So,to calculate the probability that e-mail is spam or 2016, IJCERT All Rights ReservedDOI:10.22362/ijcert/2016/v3/i12/4321Page 663

Kavitha Guda, “An Effective algorithm for Spam Filtering and Cluster Formation”, International Journal Of ComputerEngineering In Research Trends, niqueusedBayeswith more efficiency and robustness on the dataset.With efficiency, we mean that its processing speed istheorem as shown in the formula below.faster than the traditional KMean algorithm. Withrobustness, we mean that our proposed algorithmcan work efficiently with large datasets as comparedto traditional Kmean.Where:25(i)P(spamword) is the probability that an e-mailhas particularly word given the e-mail is(ii)20.2318.32 19.3620spam.15P(spam) is a probability that any given10K-Means9.11K-MeansModifiedmessage is r word appears in a spam message.5Hybrid KMeans0(iv) P(non spam) is the probability that anyparticular word is not spam.(v)P(word spam) is the probability that theparticular word appears in the non-spammessage.Toachievetheobjective,theresearch and procedure are conducted inthree phases. The phases involved are asfollows:Fig 3. Time comparisonEmail Spam with Naïve Bayes algorithmThe Accuracy, which refers to the proportion ofemails classified as accurate type in the total emails.The Evaluation MetricAccurate circumstances are True Positive (TP) andEvaluationmetricsareusedtoevaluatetheTrue Negative (TN), while false detected situationsperformance of WEKA tool based on SPAMBASEare False Positive (FP) and False Negative (FN). Thedataset that had been chosen. The most simpleaccuracy of the system is calculated by the followingmeasure is filtering accuracy namely percentage ofequation:messages classified correctly. Table 1 shows theAccuracy ((TP TN)/(TP TN FP FN)) X 100evaluation measures for spam filters.Accurecy in SpamClassificationTable 1. Evaluation measure for spam filtersEvaluationEvaluation recy inSpamClassification4. RESULTS AND DISCUSSIONFor this results we have been used an inbuilt datasetof weka Then we ran KMean clustering algorithmwhich is already defined in weka and then ranFig 4. Accuracy in spam ClassificationKMean updated on the same dataset, i.e., emaildataset. By running both the algorithms on the samedataset, we came to know that our algorithm runs 2016, IJCERT All Rights ReservedDOI:10.22362/ijcert/2016/v3/i12/4321Page 664

Kavitha Guda, “An Effective algorithm for Spam Filtering and Cluster Formation”, International Journal Of ComputerEngineering In Research Trends, 3(12):659-666,December-2016.Once the training completes, we observed that the*5 Rushdi, S. and Robet, M, “Classification spamclassifier gets a training accuracy of about 96% and aemails using text and readability features”, IEEE 13thtest accuracy of about 97.23%.International Conference on Data Mining, 2013.[6]5. CONCLUSION:Theproposed algorithm,i.e.,K-MeansUpdated emphasizes the optimum utilization ofresources while calculating KMeans. kis,processingtimereduces.Comparatively, large dataset can be processed. Withthe analysis of the K-Mean algorithm, we have foundthat we can try to improve its speed or increase tos, 2011.*7 ionalclusteringsymposiumonintelligent information technology and securityinformatics, 2011.efficiency by using our approach and moreover the*8 Shah Sourabh, Singh Manmohan, “comparison ofalgorithm can be enhanced to deal with very largea time efficient modified k-mean algorithm with k-datasets. E-mail spam filtering is an important issuemeanin the network security and machine learningconference on communication systems and networktechniques; Naïve Bayes classifier that used has atechnologies, 2012.andkmedoidalgorithm”internationalvery important role in this process of filtering e-mailspam. The quality of performance Naïve Bayes[9] Boomjia M.D, “Comparison of partitioning basedclassifier is also based on datasets that used. Naïveclustering algorithms”.Bayes classifier also can get the highest precision thatgives most top percentage spam message manage toblock if the dataset collects from single e-mailaccounts.*10 Han kwai, “Approximate distributed k-meansclustering over a peer-to-peer network”, IEEEtransactions on knowledge and data engineering,2009.REFERENCES:[11]Tariq, M., B., Jameel A. Tariq, Q., Jan, R. Nisar, A.S.,“DetectingThreatE-mailsusingBayesian*1 Marek Rychly, Pavlina Ticha, “A tool forApproach”, IJSDIA International Journal of Secureclustering in data mining”, International FederationDigital Information Age, Vol. 1. No. 2, Decemberfor Information Processing, 2007.2009.*2 P.Verma, D.Kumar, “Association Rule Mining[12]ML & KD- Machine Learning & KnowledgeAlgorithm’s Variant Analysis”, International JournalDiscoveryof Computer Application (IJCA), vol. 78, no. tSeptember 2013, pp. 26–34.*13 Rizky, W. M., Ristu, S., Afrizal, D. “The Effect of*3 L.Firte, C.Lemnaru, R.Potolea, “Spam DetectionBest First and Spreadsubsample on Selection of aFilter using KNN Algorithm and Resampling”, 6thFeature Wrapper With Naïve Bayes Classifier for TheInternational Conference on Intelligent ComputerClassification of the Ratio of Inpatients”. ScientificCommunication and Processing- IEEE, 2010, pp.27-Journal of Informatics, Vol. 3(2), p. 41-50, Nov. 2016.33.*4 G.Kaur,R.K.Gurm,“ASurveyonClassification Techniques in Internet Environment”,[14]Feng, W., Sun, J., Zhang, L., Cao, C. and Yang,Q.,Internationalin“A support vector machine based naive BayesComputer and Communication Engineering, vol. 5,algorithm for spam filtering,” 2016 IEEE 35thno. 3, March 2016, pp. formanceComputingandCommunications Conference (IPCCC), Las Vegas,NV, 2016, pp. 1-8. 2016, IJCERT All Rights ReservedDOI:10.22362/ijcert/2016/v3/i12/4321Page 665

Kavitha Guda, “An Effective algorithm for Spam Filtering and Cluster Formation”, International Journal Of ComputerEngineering In Research Trends, 3(12):659-666,December-2016.[15] Lalchand G. Titare1, Prof. Riya Qureshi,” CloudCentric loT Based Farmer’s Virtual Market place”International Journal of Computer Engineering InResearch Trends., vol.3, no.12, pp. 654-658, 2016. 2016, IJCERT All Rights ReservedDOI:10.22362/ijcert/2016/v3/i12/4321Page 666

in spam e-mail filtering, but Naïve Bayes algorithm is particularly popular in commercial and open-source spam filters [6]. This is because of its simplicity, which makes them easy to implement and just need short training time or fast evaluation to filter email spam. The filter requires training that can be

Related Documents:

Anti‐Spam 3 10 Anti‐Spam Email Security uses multiple methods of detecting spam and other unwanted email. This chapter reviews the configuration information for Anti‐Spam: Spam Management Anti‐Spam Aggressiveness Languages Anti‐Spam Aggressiveness Spam Management

Anti-spam scanning relates to incoming mail only , and in volv es chec king whether a message needs to be categorised as spam or suspected spam (depending on the spam rating of the message) and taking appropr iate action. A spam digest email and w eb based spam quar antine enables end users to manage their quarantined spam email.

Spam related cyber crimes, including phishing, malware and online fraud, are a serious threat to society. Spam filtering has been the major weapon against spam for many years but failed to reduce the number of spam emails. To hinder spammers' capability of sending spam, their supporting infrastructure needs to be disrupted.

To reduce the false detection rate. To classify between the spam and ham (non-spam) tweets. 2. Related Works [5] For detecting the spam existing in the social media platform of Twitter, a framework of semi-supervised spam detection (i.e., S3D) was proposed in the research work. Two different modules namely spam detection module

learn to identify spam e-mail after receiving training on messages that have been manually classified as spam or non-spam. A spam filter is a program that is mainlyemployed to detect unsolicited and unwanted email and prevent those messages from reaching a user's inbox. Just like other types of filtering programs, a spam filter looks for certain

Spam Filter User Guide Page 3 Getting to Know Your Spam Filter Features. Your spam filter consists of four tabs: Messages, Settings, Policies, and Status. The default setting is Messages, which displays all of the messages quarantined by the spam filter. Managing Your Quarantined Messages. The Inbound quarantine section will show the

Barracuda Spam Firewall: Login and logout activity: All logs generated by Barracuda spam virus firewall when login or logout is happened on barracuda spam firewall web interface. Barracuda Spam Filter: User login success: This category provides information related to user login success into barracuda spam filter.

2 Spam detection accuracy is the industry -standard metric used to measure how accurate an anti spam filter is at correctly identifying spam. Generally, higher spam detection accuracy is obtained at the cost of a higher false positive rate. A good anti-spam filter will have an acceptable trade-off between the two metrics.