Random Histogram Forest For Unsupervised Anomaly Detection

3y ago
60 Views
9 Downloads
257.17 KB
6 Pages
Last View : 16d ago
Last Download : 3m ago
Upload by : Joanna Keil
Transcription

Random Histogram Forestfor Unsupervised Anomaly DetectionAndrian PUTINA, Mauro SOZIODario ROSSI, José M. NAVARROTelecom Paris, Paris, Francename.surname@telecom-paris.frHuawei Technologies SASU, Paris, Francename.surname@huawei.comAbstract—Roughly speaking, anomaly detection consists ofidentifying instances whose features significantly deviate from therest of input data. It is one of the most widely studied problems inunsupervised machine learning, boasting applications in networkintrusion detection, healthcare and many others. Several methodshave been developed in recent years, however, a satisfactorysolution is still missing to the best of our knowledge. We presentRandom Histogram Forest an effective approach for unsupervisedanomaly detection. Our approach is probabilistic, which hasbeen proved to be effective in identifying anomalies. Moreover,it employs the fourth central moment (aka kurtosis), so as toidentify potential anomalous instances. We conduct an extensiveexperimental evaluation on 38 datasets including all benchmarksfor anomaly detection, as well as the most successful algorithmsfor unsupervised anomaly detection, to the best of our knowledge.We evaluate all the approaches in terms of the average precisionof the area under the precision-recall curve (AP). Our evaluationshows that our approach significantly outperforms all otherapproaches in terms of AP while boasting linear running time.I. I NTRODUCTIONHawkins [1] defines an anomaly (aka outlier) as “an observation, which deviates so much from other observationsas to arouse suspicions that it was generated by a differentmechanism.” Identifying those anomalies is crucial in manyapplications, as it might reveal an unauthorized access incomputer networks or a disease in a patient, for example.Thus, anomaly detection boasts applications in networkintrusion detection, fraud detection, healthcare, and manyothers, while providing a valuable tool in data cleaning.It is one of the most widely studied problems in bothunsupervised and supervised machine learning. Because of thedifficulties in obtaining large amounts of labeled data, as wellas, in dealing with high class imbalance [2], unsupervisedapproaches preserve their appeal. As a result, unsupervisedanomaly detection has received increasing attention in recentyears (e.g. Isolation Forest [3], OCSVM [4], LOF [5]). Amongthe supervised approaches we mention random forest [6] andautoencoders [7].One of the most effective algorithms for unsupervisedanomaly detection is Isolation Forest (iForest), as confirmedalso by our experimental evaluation. In our work, we presentRandom Histogram Forest (RHF) an effective approach forunsupervised anomaly detection. Similarly to iForest, ourapproach is probabilistic while relying on an “ensemble”of “weak” building blocks (trees) for effectively identifyinganomalies. This has been proved to be effective for a widerange of tasks (e.g. Random Forest [6], Isolation Forest [3]).Our approach builds a random forest based on all inputinstances, whereas iForest builds a random forest based solelyon some random samples of the data. The latter strategyhas the drawback that some anomalies might be neglectedentirely in the construction of the forest, thereby impairingthe capability of the algorithm of finding those anomalies.Nevertheless, our algorithm boasts linear running time in thesize of the input. Another key idea of our algorithm is toemploy the fourth central moment (aka kurtosis), so as to guidethe search for anomalous instances. Our algorithm computesa score for every instance, measuring the likelihood of suchan instance of being an anomaly. Such a score is defined asthe Shannon’s information content of the leaf containing thecorresponding instance.We conduct an extensive experimental evaluation on 38datasets including all benchmarks for anomaly detection,as well as the most successful algorithms for unsupervisedanomaly detection, to the best of our knowledge. We evaluateall the approaches in terms of the average precision of thearea under the precision-recall curve (AP). We observe that assuggested in [8], ROC might not reflect the real performancesof the algorithm, in that, anomalies typically represent a smallfraction of the input data.Our experimental evaluation shows that RHF outperformsall other approaches in terms of AP. In particular, it outperforms iForest in terms of AP by a factor of 10% on averageand up to a factor of 2 in some datasets. Moreover, it showsthat the performances of our algorithm are consistently goodover a wide range of parameter values, which allows for aneffective parameter tuning.The rest of the paper is organized as follows. We startby overviewing top performer anomaly detection algorithmsin section II and introduce the Random Histogram Forestalgorithm in section III. We then describe our experimentalevaluation and results in section IV. Finally, we discuss andsummarize our main findings in section V.II. R ELATED W ORKSeveral unsupervised anomaly detection algorithms havebeen developed in recent years. We can classify the relatedwork as follows: (i) Probabilistic/Linear based methods (e.g.

density of each instance against the local density of neighbors.Ensemble/Isolation based methods isolate anomalies instead of profiling normal instances by recursively splitting thedata through a random tree and generating so isolation forests.Isolation Forest [3]: builds a forest of randomly generatedtrees and assigns to each instance x X the average pathlength from the root to the node containing x as anomaly score.The authors show empirically that shorter path lengths arerepresentative of anomalies as they are more easily to isolatewith respect to normal data.Based on the most recent comparison studies [9], [10], [15],the algorithms previously described have proven to be amongthe best in regards anomaly detection. In general both [10] and[15] suggest iForest to be, on average, the best one closelyfollowed [10] by PPCA and OCSVM.PPCA, HBOS, OCSVM, etc.); (ii) Proximity/Nearest-Neighborbased methods (e.g. K-NN, K th -NN, Local Outlier Factor);(iii) Ensemble/Isolation based methods (e.g. Isolation Forest).Among the most recent comparative studies of unsupervisedtechniques, [9] compares most of the existing proximity-basedmethods on 10 different datasets and conclude that it is of greatimportance the initial assumption whether the anomalies in thedatasets are global or local: they recommend to use a globalanomaly detection methods if there is no further knowledgeabout the nature of anomalies in the dataset to be analyzed.[10] compares 14 methods belonging to all the groups previously described on 15 different datasets (12 publicly availableand 3 private ones). However, their study assesses if the modelsare able to generalize to future instances, so they perform aMonte Carlo cross validation of 5 iterations, using 80% of thedata for the training phase and 20% for the prediction whichindicates a semi-supervised setup to our understanding. While[9] study does not include the latest methods presented (e.g.Isolation Forest, thought to be the state-of-art), [10] describesthe models generalization capacity using labels that most ofthe time are not available. We compare the methods whichhave proven to be the best in previous studies [9], [10] usingdefault or reasonable parameters and use labels only to assesstheir performance in a completely unsupervised environment.Probabilistic based methods estimate the parameters θ ofthe dataset X and assigns to each instance x X an anomalyscore equal to the likelihood P (X θ).Probabilistic Principal Component Analysis (PPCA) [11]:estimates the principal components of the data and projectsthe d-dimensional dataset to a q-dimensional one estimatingthe latent variables by iteratively maximizing the likelihoodusing an expectation-maximimation algorithm.Histogram-based Outlier Score (HBOS) [12]: generates ahistogram for each feature assuming they are independent.Similar to the Naive Bayes approach in which all the independent feature probabilities are multiplied, HBOS outputs ananomaly score given by the multiplication of the inverse heightof the bins of all the features.One-Class SVM [4]: determines a separating hyperplanein a higher dimensional space by maximizing the distancefrom the hyperplane to the origin. The ν parameter acts asa regularization parameter as determines an upper bound onthe percentage of data falling outside the boundary and a lowerbound on the number of support vectors.Proximity/Nearest-Neighbor based methods compute theneighborhood of all the instances x X and uses the distancesof each instance x to describe abnormality.K-NN [13] and Kth-NN [14]: Both K-NN and K th -NNcompute the distances for each instance x X to the knearest-neighbors and assign them either the mean distance(K-NN) or the max one (k th -NN).Local Outlier Factor (LOF) [5]: introduced first the ideaof local anomalies and detects them by comparing the localIII. R ANDOM H ISTOGRAM F OREST (RHF)RHF is an ensemble model apt at splitting the dataset into ηdifferent groups. The algorithm randomly partitions the inputpoints by means of several splits drawn at random. Intuitively,points that end up in a relatively large group are less likely tobe anomalies. By iterating such a process multiple times, wecan measure how likely a point is an anomaly.To put this idea into practice, RHF splits the data into ηdifferent bins by means of a Random Histogram Tree (RHT).Each RHT is built by recursively splitting the whole dataset:each splitting decision is done selecting an attribute a to splitaccording to its kurtosis score and a split value randomlyselected from its support. By setting the parameter max heighth, each RHT produces at most η 2h leafs corresponding toη bins in which all the instances are grouped. Finally, theanomaly score of each instance x X is the informationcontent (aka self-entropy) of its terminating leaf Q aggregatedover all the trees.Random Histogram Tree: Given a dataset X Rn d , wheren is the number of instances and d the number of features orattributes, a RHT is a binary tree in which a node Qi is eitheran internal node with exactly two children or a leaf node withno children. In the former case, the node splits the datasetinto a left branch XQL and a right one XQR , according to thekurtosis Split. A RHT contains at most η 2h leafs, where his the maximum height of the tree.kurtosis Split: a kurtosis split chooses according to whichattribute a d to split the dataset X by assigning higherprobability to features whose kurtosis scores are higher. Givena random variable X , the kurtosis score is defined as:" K[X ] EX µσ 4 #hi4E (X µ)µ4 hi2 4 ,σ2E (X µ)(1)where µ4 is the fourth central moment and σ the standarddeviation, measures the tailedness of X . Equation (1) denotes2

AnnthyroidK(X1) 264.77 80 K(X2) 48.832.5 K(X0) 2.12502.06040anomalies1.5anomalies30 heavy tail40heavy tailhigh kurtosis1.0highkurtosis20200.5100.00.000.00.51.0 00.0 0.2 0.40.1MulcrossK(X0) 3.01K(X1) 2.7K(X2) 4.90.8330.6220.4110.2the standardized data raised to the fourth power. As a result, instances within the region of the peak have negligiblecontribution to the kurtosis score, while instances outsidethe region of the peak (e.g. outliers) contribute the most. In[16], Moors defined it as a measure of dispersion, while heconcluded that high values of K are due to either i) occasionalvalues far from the mean in a distribution whose probabilitymass is concentrated around the mean or ii) probability massconcentrated in the tails of the distribution. The kurtosisscore measures the heaviness of the tails and it is thereforean indicator of outliers’ existence in the tail. Consider, forexample, Fig. 1 representing 4 features extracted from datasetsAnnthyroid (top) and Mulcross (bottom) depicting both normal and anomalous probability density functions. It is easilyobservable that features with heavier tails and consequentlyhigher kurtosis score (e.g. X1 -top and X2 /X3 -bottom) aremore likely to contain anomalous instances than the remainingones (e.g. X0 /X4 -top and X0 /X1 -bottom in which anomaliesare clearly not separable: the probability functions overlap).We compute the logarithm y log(x 1) of the kurtosisscore so as to focus on its order of magnitude, while preventingour approach from being sensitive to small changes on suchscore. Notice that we add 1 to all the scores such that constantfeatures (kurtosis 0) obtain a transformed score equal to 0 andthus not selectable as a splitting attribute.A kurtosis Split therefore:icomputes the kurtosis scores (1) of all the attributesa d and defines the kurtosis Sum Ks :Ks dXlog [K(Xa ) 1]05iii0.0105normalanomalous00.00 0.25 0.506 K(X3) 3.99422.5 0 2.5normalanomalous0.02.5number of leaves is bounded by 2h , where h is the maximumheight of the tree (input parameter).Anomaly Score: Given an instance p X and a RandomHistogram Tree RHTi RHF , we define the anomaly scoreof p w.r.t. RHTi as: RHTi (p) log 1,PQjp S(Qj ),(5)which can be seen as the Information Content (also calledShannon’s information [17]) of p in RHTi . The InformationContent measures the level of “surprise” of an event, with rareevents being more surprising than relatively common events.As a result, the smaller the cardinality of S(Qj ) the morelikely p is considered to be an outlier. We adopt the conventionthat log(0) . The anomaly score of p w.r.t. to RHF isobtained by summing the scores over all RHTi ’s as follows.(2)picks a random value from X U [0, Ks ]:r X U [0, Ks ]50.0 2.50K(X3) 14.12Fig. 1. Probability Density Function of 4 features depicting both normal andanomalous class extracted from datasets Annthyroid and Mulcross respectively. It is easily observable that features with heavier tails and consequentlyhigher kurtosis score (e.g. X1 -top and X2 /X3 -bottom) are more likely tocontain separable anomalous instances than remaining ones (e.g. X0 /X4 -topand X0 /X1 -bottom in which anomalies are clearly not separable).a 0ii0015(3)designates the attribute to split as according to thechoice r in Ks :!iXas argmin i log [K(Xi ) 1] r(4)RHF (p) tXRHTi (p)(6)i 0a 0The pseudocode of the algorithm to construct a randomhistogram forest is provided in Algorithm 1, while Algorithm 2shows how to compute the anomaly score for a point p.The running time of Algorithm 1 is O(ntdh), where hdenotes the maximum height, t denotes the number of trees,n the number of instances while d denotes the number ofdimensions d.Internal Node: Given an internal node Q, the latter splits thedata in exactly two subsets XQL and XQR (if possible) byselecting an attribute as d according to the kurtosis Splitcriterion and a splitting value av (minas , maxas ) until:ithe maximum height h of the tree is reached.iithere are XQ 1 instances in Q.iiithere is only one distinct instance in Q. All theremaining instances are thus duplicates.Leaf: A leaf is a node with no children which is populated byat least one instance. Given a leaf Q, we denote with S(Q)the set of distinct instances associated to Q. We also definePQ : S(Q) n , which can be seen as the probability that aninstance is associated to the leaf Q. We recall that the totalIV. E XPERIMENTAL E VALUATIONA. Settings.Libraries: Our experimental evaluation is conductedon a Linux Fedora 31 server equipped with Intel(R)Xeon(R) CPU E5-2665 @ 2.40GHz - 32 CPUs and 48 GB3

increasing the number of nearest-neighbors K nn [20, 30]and aggregate the scores. As already successfully done in [10],we use OCSVM’s default parameters kernel rbf, degree 3with regularization parameter ν 0.5. Finally, iForest usesits default parameters t 100 and sample size ψ 256. RHFuses t 100 and h 5 corresponding to at most 32 leafs.Both Isolation Forest and RHF are run 10 times.Metrics: We evaluate the performance of the algorithms by measuring the Average Precision (AP) of thePrecision-Recall Area Under the Curve (without interpolaPtption): AP : n (Rn Rn 1 ) Pn where Pn tp fp andtpthRn tp f n are the Precision and Recall at the n threshold.We observe that the Receiver Operating Characteristic (ROC)is often employed to evaluate anomaly detection methods [9]- [3]. However, it has been shown in [8] that when the classesare not balanced (e.g. anomalies) the AP curves better reflectthe efficacy of an algorithm. [8] shows moreover that a curvedominates in ROC if and only if it dominates in AP space.Datasets: We put a major effort in providing an extensiveexperimental evaluation. In particular, we include all datasetsthat have been considered in the literature for anomaly detection, to the best of our knowledge. This is crucial to ensurea fair comparison, in that, the overall results might changedramatically depending on the selection of the datasets, aspointed out in [15]. We use 38 publicly available benchmarkdatasets ranging from 240 to 623091 instances and from 3 to274 features. Each of them is available either at the UCI [24]or at the ODDS [25] repositories. We furthermore consider therecently released WikiQOE [26] dataset, which consists of aWikipedia large measurements campaign of WebQOE metrics.The KDD’99 Cup dataset is one of the most widelyused benchmark for anomaly detection. The dataset containsinformation about network connections as exchanged bytes(“source bytes”, “destination bytes”, etc.) and service type(“http”, “smtp”, “ftp”, etc.). It consists of 4,898,431 instancesand 41 attributes. Similarly to the filtering technique usedby [27] [9] we extract 5 subsets according to the values ofthe service attribute (http, smtp, ftp, finger and other). Outof the 41 available attributes, we select, as already donein [27] only 3 of them namely “duration”, “source bytes”,and “destination bytes” as they are thought to be the mostrelevant ones [27]. We obtain in this way the datasets wecall kdd http (623091 instances), kdd smtp (95554 instances),kdd ftp (5214 instances), kdd finger (1033 instances) andkdd other (12844 instances). While [9] filtered the datasetaccording to the service attribute only, [27] filters them also bythe positive logged in attribute as they are successful attacks.We also consider this additional filter by further reducing thekdd http dataset into the http logged (567498 instances) oneby excluding the negative values of logged in attribute.In order to determine to which extent the presence ofduplicates might affect the overall results, we consider alsoa smaller version in which duplicates have been filtered out:Algorithm 1: RHF(X, nd, h)Input: dataset X, node depth nd and max height hOutput: RHF1: if nd h or X 1 then2:return Leaf{S(Q)}3: else4:let A be the set of attributes5:select the attribute to split as according to kurtosisSplit described in (2), (3), (4)6:select a random split value av in the min and maxsupport of as in X7:Xl filter(X as av )8:Xr filter(X as av )9:return Node {value av ,attribute as ,left RHT(Xl , nd 1, h),right RHT(Xr , nd 1, h)}10: end ifAlgorithm 2: Score(x, node)Input: instance x X and an RHT nodeOutput: anomaly score given RHT1: if node is Leaf then2:P node{S(Q)}/ X 3:return log(1/P )4: else5:a node.attribute6:v node.value7:if xa v then8:score(x, node.left)9:else10:score(x, node.right)11:end if12: end ifRAM. Our code is written in Python 3 [18] while it usesN umP y 1.17.4 [19] and P andas 0.25.3 [20]for data preprocessing. The implementations of thealgorithms described in Section II belong to eitherP yOD 0.7.9 [21] (HBOS, PPCA, K-NN, K th -NN,OCSVM) or Scikit learn 0.23.1 [22] (iForest andLOF) packages. Our RHF’s Python 3 implementation isavailable at [23].Parameters: We set the parameters of each approach,considered in our experimental evaluation, while following thedirections of the corresponding authors. In particular, we runHBOS selecting the input parameter number of bins u

unsupervised machine learning, boasting applications in network intrusion detection, healthcare and many others. Several methods have been developed in recent years, however, a satisfactory solution is still missing to the best of our knowledge. We present Random Histogram Forest an effective approach for unsupervised anomaly detection.

Related Documents:

rank-deficiency problem in building unsupervised density forests from the high dimensional data. Random forest manifold and lipreading. Consider-ing the high efficiency of random forest, it has been used to find data embeddings. Gray et al. [10] employed a su-pervised classification random forest to derive the distance

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

(A) boreal forest º temperate forest º tropical rain forest º tundra (B) boreal forest º temperate forest º tundra º tropical rain forest (C) tundra º boreal forest º temperate forest º tropical rain forest (D) tundra º boreal forest º tropical rain forest º temperate forest 22. Based on the

Start by finding out how Python generates random numbers. Type ?random to find out about scipy's random number generators. Try typing 'random.random()' a few times. Try calling it with an integer argument. Use 'hist' (really pylab.hist) to make a histogram of 1000 numbers generated by random.random. Is th

Start by finding out how Python generates random numbers. Type ?random to find out about scipy's random number generators. Try typing 'random.random()' a few times. Try calling it with an integer argument. Use 'hist' (really pylab.hist) to make a histogram of 1000 numbers generated by random.random. Is the distribution Gaussian, uniform, or .

– visualize the forest – replace missing values – identify mislabeled data, outliers, novel cases . Local variable importance. 8. Visualization. Part 4 Other Applications 9. Random forests for unsupervised learning. 10. Random forests for regression. 11. Random forests for survival analysis. 12. Some Case Studies . Unsupervised Learning .

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

ANSI/AAMI HE74 (2001-2010) “Human factors design process for medical devices” ANSI/AAMI HE75 (2009- ) “Human factors engineering - Design of medical devices” (a Tutorial to HE-74) 37 . US & FDA FDA Human Factors Draft Guidance Document: Agency Expectations for Human Factors Data in Premarket Submissions Applying Human Factors and Usability Engineering to Optimize .