Embedding Network Information For Machine Learning-based .

2y ago
39 Views
7 Downloads
1.33 MB
92 Pages
Last View : 4m ago
Last Download : 2m ago
Upload by : Matteo Vollmer
Transcription

Embedding Network Information for Machine Learning-basedIntrusion DetectionJonathan D. DeFreeuwThesis submitted to the Faculty of theVirginia Polytechnic Institute and State Universityin partial fulfillment of the requirements for the degree ofMasters of ScienceinComputer EngineeringJoseph G. Tront, ChairRandy MarchanyYaling YangDecember 12, 2018Blacksburg, VirginiaKeywords: intrusion detection, machine learning, word embeddingsCopyright 2018, Jonathan D. DeFreeuw

Embedding Network Information for Machine Learning-basedIntrusion DetectionJonathan D. DeFreeuw(ABSTRACT)As computer networks grow and demonstrate more complicated and intricate behaviors,traditional intrusion detections systems have fallen behind in their ability to protect networkresources. Machine learning has stepped to the forefront of intrusion detection researchdue to its potential to predict future behaviors. However, training these systems requiresnetwork data such as NetFlow that contains information regarding relationships betweenhosts, but requires human understanding to extract. Additionally, standard methods ofencoding this categorical data struggles to capture similarities between points. To counteractthis, we evaluate a method of embedding IP addresses and transport-layer ports into acontinuous space, called IP2Vec. We demonstrate this embedding on two separate datasets,CTU’13 and UGR’16, and combine the UGR’16 embedding with several machine learningmethods. We compare the models with and without the embedding to evaluate the benefitsof including network behavior into an intrusion detection system. We show that the additionof embeddings improve the F1-scores for all models in the multiclassification problem givenin the UGR’16 data.

Embedding Network Information for Machine Learning-basedIntrusion DetectionJonathan D. DeFreeuw(GENERAL AUDIENCE ABSTRACT)As computer networks grow and demonstrate more complicated and intricate behaviors,traditional network protection tools like firewalls struggle to protect personal computers andservers. Machine learning has stepped to the forefront to counteract this by learning andpredicting behavior on a network. However, this learned behavior fails to capture muchof the information regarding relationships between computers on a network. Additionally,standard techniques to convert network information into numbers struggles to capture manyof the similarities between machines. To counteract this, we evaluate a method to capturerelationships between IP addresses and ports, called an embedding. We demonstrate thisembedding on two different datasets of network traffic, and evaluate the embedding on onedataset with several machine learning methods. We compare the models with and withoutthe embedding to evaluate the benefits of including network behavior into an intrusiondetection system. We show that including network behavior into machine learning modelsimproves the performance of classifying attacks found in the UGR’16 data.

AcknowledgmentsI would like to thank Dr. Tront and Professor Marchany for their patience and wisdom inthe last few years as I learned what it meant to truly research and learn. Their efforts inhelping me succeed cannot be overstated, and their help has certainly set me up for successin my future career.I would like to acknowledge the love and support of my parents, Brian and Dana, mygirlfriend, Jessica, and Brian and Deanne Burch. Without their positivity during my timesof struggle, this thesis would not have been possible.I want to thank the members of the IT Security Lab, particularly Ryan Kingery and ZacharyBurch, for being awesome sounding boards for my thoughts throughout this entire work.I also want to acknowledge the National Science Foundation for their funding through theCyberCorps: Scholarships for Service program.iv

ContentsList of FiguresviiiList of Tablesx1 Introduction11.1Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.2Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.3Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32 Background2.14Categorical Data Representation . . . . . . . . . . . . . . . . . . . . . . . . .42.1.1Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52.1.2Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52.2Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82.3Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102.3.1PCA — Principal Component Analysis . . . . . . . . . . . . . . . . .102.3.2t-SNE — t-Distributed Stochastic Neighbor Embedding . . . . . . . .112.3.3Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12v

3 Review of Literature3.13.216Security Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .163.1.1DARPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .163.1.2KDD’99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173.1.3NSL-KDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183.1.4CTU’13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .193.1.5UNSW-NB15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .193.1.6UGR’16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .213.2.1Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223.2.2Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .233.2.3Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .253.2.4Categorical Data Representation . . . . . . . . . . . . . . . . . . . .274 Experimental Design4.128Binned IP2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .294.1.1Choosing Word Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . .294.1.2Embedding Model Design . . . . . . . . . . . . . . . . . . . . . . . .315 Evaluation5.135Binned IP2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vi35

5.25.1.1CTU’13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .365.1.2UGR’16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .455.2.1Data Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .465.2.2Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .485.2.3Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . .495.2.4Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .556 Discussion6.157Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .577 Conclusion59Bibliography60Appendices71Appendix A Feature Importances72vii

List of Figures2.1Example of an embedding of a TCP port . . . . . . . . . . . . . . . . . . . .62.2Comparison between categorical data representations . . . . . . . . . . . . .72.3Sliding window for determining target and context words in word embedding82.4Samples of the MNIST handwriting dataset . . . . . . . . . . . . . . . . . .132.5PCA reduction on the MNIST dataset . . . . . . . . . . . . . . . . . . . . .142.6t-SNE reduction on the MNIST dataset . . . . . . . . . . . . . . . . . . . . .154.1Choosing word pairs in IP2Vec. . . . . . . . . . . . . . . . . . . . . . . . .294.2Network design for IP2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . .315.1Graphing pipeline for IP2Vec . . . . . . . . . . . . . . . . . . . . . . . . . .375.22D t-SNE of 32-dimensional IP2Vec embedding with min samples 1 . . . .385.32D t-SNE of 32-dimensional IP2Vec embedding with min samples 2 . . . .395.42D t-SNE of 32-dimensional IP2Vec embedding with min samples 5 . . . .395.5Rolling average loss of IP2Vec on UGR’16 . . . . . . . . . . . . . . . . . . .425.6t-SNE reduction of 32-dimensional IP2Vec embedding on UGR’16, with min samples 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.744t-SNE reduction of 32-dimensional IP2Vec embedding on UGR’16, with min samples 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .viii45

A.1 Importances for Features Ranked 1-75 in Non-Embedded XGBoost . . . . .73A.2 Importances for Features Ranked 76-150 in Non-Embedded XGBoost . . . .74A.3 Importances for Features Ranked 1-75 in Embedded XGBoost . . . . . . . .75A.4 Importances for Features Ranked 76-150 in Embedded XGBoost . . . . . . .76A.5 Importances for Features Ranked 1-75 in Non-Embedded Random Forests . .77A.6 Importances for Features Ranked 76-150 in Non-Embedded Random Forests78A.7 Importances for Features Ranked 1-75 in Embedded Random Forests . . . .79A.8 Importances for Features Ranked 76-150 in Embedded Random Forests . . .80ix

List of Tables2.1Comparing CBOW vs skip-gram for generating word pairings . . . . . . . . .95.1Hardware-Software Configuration . . . . . . . . . . . . . . . . . . . . . . . .355.2Training Statistics for IP2Vec on CTU’13 . . . . . . . . . . . . . . . . . . . .405.3Server statistics for UGR’16 . . . . . . . . . . . . . . . . . . . . . . . . . . .435.4Client statistics for UGR’16 . . . . . . . . . . . . . . . . . . . . . . . . . . .435.5Training Statistics for IP2Vec on UGR’16 . . . . . . . . . . . . . . . . . . . .445.6Period of attacks chosen from days in UGR’16 [14] . . . . . . . . . . . . . . .475.7Nonembedded features for supervised learning . . . . . . . . . . . . . . . . .495.8Embedded features for supervised learning, using binned IP2Vec. . . . . . . .495.9Evaluation metrics for XGBoost with and without IP2Vec . . . . . . . . . .525.10 Confusion matrix of test set for XGBoost using non-embedded features . . .525.11 Confusion matrix of test set for XGBoost using features embedded with IP2Vec 535.12 Evaluation metrics for random forests with and without IP2Vec . . . . . . .535.13 Confusion matrix of test set for random forests using non-embedded features545.14 Confusion matrix of test set for random forests using features embedded withIP2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .545.15 Evaluation metrics for MLP with and without IP2Vec . . . . . . . . . . . . .55x

5.16 Confusion matrix of test set for MLP using non-embedded features . . . . .555.17 Confusion matrix of test set for MLP using features embedded with IP2Vec .56xi

List of AbbreviationsCART Classification and Regression TreeCBOW Continuous Bag-of-WordsCNN Convolutional Neural NetworkLSTM Long Short-Term MemoryMLP Multi-Layer PerceptronNIDS Network Intrusion Detection SystemNLL Negative Log-LikelihoodNLP Natural Language ProcessingPCA Principal Component AnalysisRNN Recurrent Neural NetworksSMOTE Synthetic Minority Oversampling TechniqueSVM Support Vector Machinet-SNE t-Distributed Stochastic Neighbor Embeddingxii

Chapter 1IntroductionAs networks continue to grow in complexity and traffic throughput, the tools used to monitorthem for malicious behavior have struggled to keep the pace. These systems, called NetworkIntrusion Detection Systems (NIDSs), are used to analyze the traffic on a network and assistnetwork administrators in detecting inbound and outbound attacks.Most NIDSs in use today, such as Snort [1] and Bro [2], rely on a corpus of known attacksignatures in order to detect incoming malicious traffic. These signature-based NIDSs are extremely effective at detecting known attacks, but are inadequate at identifying novel attacks.Rather than looking for particular values within a piece of data, other systems use statisticalanalysis to determine if traffic deviates from a known ‘normal’ behavior. Anomaly-basedNIDSs are more generalizable to new attacks, but tend to incorrectly classify benign behavior as anomalous. Anomaly-based detection tools can be configured using data collectorssuch as Splunk [3] and Elasticsearch [4], generating alerts when collected data deviates fromthe norm.Machine learning techniques have been explored in security research as a means to improveNIDSs. Clustering algorithms such as [5] and [6] detect outliers in the well-known securitydatasets DARPA [7] and NSL-KDD [8]. Neural networks, including convoluted and recurrentnetworks, have been developed using the same datasets, showing even more accuracy indetecting outliers given a labeled dataset. However, while these algorithms may work wellfor detecting anomalies in datasets approaching 20 years old, fewer models have been trained1

2Chapter 1. Introductionusing real-life data.Cisco’s NetFlow protocol has been used in previous work for intrusion detection and trafficclassification [9]–[11] due to its efficient way of describing the behavior of a network. NetFlowconsists of several important pieces of information regarding a network connection, or flow,including source and destination IP addresses, ports, protocols, and flags. The majority ofdata within a flow is categorical, meaning while there are numbers to represent the data,the information that is inferred from the data is not easily represented in a numerical space.This makes feeding NetFlow into machine learning models difficult due to limited methodsin making NetFlow interpretable to a model.1.1Research ProblemAlthough machine learning has proven to be a rich source of new research for intrusiondetection systems, augmenting network data for machine learning models remains a difficulttask. Due to the complexities of the network stack, with protocols like IP, TCP, and UDP,a significant amount of information is lost when interpreting addresses and ports as theirinteger representations, as is required for most learning models. If we choose to train modelswithout features such as IP address or port, we lose out on any potential information thatthose features could give our models.While a considerable amount of work has been done to analyze machine learning methodsfor network intrusion detection (see 3.2), a majority of work has been done using syntheticdatasets, meaning the data was generated in an environment designed to mimic a real-worldnetwork. This is not a preferred method, particularly if we want to explore the deploymentof such intrusion detection systems in a real-world environment. Crafting real-world Internettraffic is a non-trivial issue [12], especially in the complex and protocol-rich atmosphere of

1.2. Proposed Solution3today’s Internet.1.2Proposed SolutionThis thesis aims to evaluate the use of an embedding technique devised by Ring et al.,named IP2Vec [13]. IP2Vec enables the encapsulation of network behavior into a machineunderstandable format, called an embedding. We implement IP2Vec, and modify it for use inlarger networks than the original implementation, as well as for potential use in a streamingenvironment. We compare implementations to determine the effect of the modifications used.To gauge the effectiveness of the information gathered by the embedding, we use IP2Vec toembed the network features of NetFlow data before training supervised learning models forintrusion detection. Rather than use synthetic network data, we use the UGR’16 dataset[14]. UGR’16 contains includes a collection of synthetic and live traffic captured in a workingenterprise environment, recorded over several months. The models are evaluated by F1-scoresand confusion matrices of the test data.1.3Thesis OutlineThis thesis is organized as follows. Chapter 2 provides background knowledge regarding theconcepts utilized in the system design. Chapter 3 examines related research in the fieldsof security datasets and machine learning. In Chapter 4, our design is described and withthe results of the evaluation in Chapter 5. We discuss future work in Chapter 6 and finallyconclude the thesis in Chapter 7.

Chapter 2Background2.1Categorical Data RepresentationIn machine learning, we refer to two types of data inputs: continuous and categorical data.Continuous data is data that has meaning when represented as a number: for example, totalnumber of bytes or packets, and time in seconds. This type of data is easily recognizableby learning algorithms, as functions can be made to map input to output in most cases.Categorical data refers to data that exists as a finite set of values. For instance, there areonly 232 IPv4 addresses, so an IP would be categorical. Transport layer protocols are alsocategorical (TCP, UDP, ICMP). Categorical data tends to be represented as strings, whilecontinuous data is represented as floats or integers.A shortcoming of most machine learning techniques is their inability to handle categoricalvariables directly. While algorithms such as random forests can handle categorical data,other methods such as clustering or XGBoost require modifications to the data. An exampleof categorical data is the source port field in a flow record. While the ports are representedas numbers (SSH:22, HTTP:80), models would learn relationships between ports relativeto their numerical representation, not the service offered. This introduces problems whenservices use alternative ports such as 8080 for HTTP. During training, we would prefer amodel to learn that ports 80 and 8080 are more similar than 80 and 81.4

2.1. Categorical Data Representation2.1.15EncodingThe simplest way to overcome the issue of categorical data is to use one-hot encoding. Thisconverts discrete data into sparse vectors, where in an encoding of length n, there are n 1zeros, and a single 1 in the vector. The ‘1’ value represents the data that we are trying toencode. In the encoding vector, there is an index for each unique value within the range ofthe feature. This means that to encode a source port, our vector will have a length of 65536.Because the vector contains a single index for each unique value, encoding becomes inefficientwhen there is a large set of values, especially a large number of rare and underutilized values.To combat this, we can condense the feature vector by only creating encodings for the mostfrequent values within the feature’s unique values. For example, when encoding TCP/UDPports, we can reserve encodings for common ports like 22, (SSH), 443 (HTTPS), and 3389(RDP). For all other ports, a single index is reserved in the vector. When encoding the‘other’ ports, this value is 1, and all other values in the vector are 0. We refer to this methodof encoding as binned one-hot encoding.2.1.2EmbeddingWhile binned one-hot encoding reduces the vector length for our port problem, it significantlyreduces the amount of information that can be learned for all of the words in the ‘other’category. One-hot encoding becomes infeasible for other categorical variables, particularlywords for Natural Language Processing (NLP). For example, if we wanted to use binnedone-not encoding on the 100 most frequently used TCP/UDP ports, we would lose contexton 100/65536 99.85% of usable ports.Instead, the preferred technique for representing large numbers of unique values is calledembedding. We refer to this collection of unique values as a vocabulary in the context of

6Chapter 2. Backgroundembeddings. To create an embedding, we attempt to predict a output word given an inputword, generating a dense weight matrix (meaning most attributes are non-zero) for the entirevocabulary. This results in a matrix of size n m, where n is the size of the vocabulary andm is the size of our embedding. Each row represents a single word, accessed by multiplyinga one-hot encoding vector by the weight matrix. Figure 2.1 shows an example of an HTTPport being embedded. The embedding can be trained independently from other models, orintegrated into larger models as a preprocessing layer.Weight Mat rixOne-Hot Encodingssh ft p ht t p rdp smt w0 w1.14

Intrusion Detection Systems (NIDSs), are used to analyze the tra c on a network and assist network administrators in detecting inbound and outbound attacks. Most NIDSs in use today, such as Snort [1] and Bro [2], rely on a corpus of known attack signatures in order to detect incoming malicious tra c. These signature-based NIDSs are ex-

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

I Lekui Zhou et al. Dynamic Network Embedding by Modeling Triadic Closure Process. AAAI2018. I Yuan Zuo et al. Embedding Temporal Network via Neighborhood Formation. KDD2018. I Petar Velickovic et al. Graph Attention Networks. ICLR2018. I Yuanfu Lu et al. Relation Structure-Aware Heterogeneous Information Network Embedding. AAAI2019.

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

och krav. Maskinerna skriver ut upp till fyra tum breda etiketter med direkt termoteknik och termotransferteknik och är lämpliga för en lång rad användningsområden på vertikala marknader. TD-seriens professionella etikettskrivare för . skrivbordet. Brothers nya avancerade 4-tums etikettskrivare för skrivbordet är effektiva och enkla att