Machine-Learning-based Analysis Of Program Binaries: A .

2y ago
28 Views
3 Downloads
7.40 MB
25 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Jamie Paz
Transcription

Digital Object IdentifierMachine-Learning-based Analysis ofProgram Binaries: A ComprehensiveStudyHONGFA XUE* , (Student Member, IEEE), SHAOWEN SUN* , GURU VENKATARAMANI,(Senior Member, IEEE) and TIAN LAN, (Member, IEEE)The George Washington University, Washington, DC, 20052 (e-mail: {hongfaxue, sunshaowen, guruv, tlan}@gwu.edu)Hongfa Xue and Shaowen Sun contributed equally to this work.*Corresponding author: Hongfa Xue (e-mail: hongfaxue@gwu.edu).ABSTRACT Binary code analysis is crucial in various software engineering tasks, such as malwaredetection, code refactoring and plagiarism detection. With the rapid growth of software complexity andthe increasing number of heterogeneous computing platforms, binary analysis is particularly critical andmore important than ever. Traditionally adopted techniques for binary code analysis are facing multiplechallenges, such as the need for cross-platform analysis, high scalability and speed, and improved fidelity,to name a few. To meet these challenges, machine learning-based binary code analysis frameworks attractsubstantial attention due to their automated feature extraction and drastically reduced efforts needed onlarge-scale programs. In this article, we provide a taxonomy of machine learning-based binary code analysis,describe the recent advances and key findings on the topic, and discuss the key challenges and opportunities.Finally, we present our thoughts for future directions on this topic.INDEX TERMS Machine learning, Program binary analysis, Taxonomy.I. INTRODUCTIONINARY code analysis (BCA) allows software engineersto directly analyze binary executables without access tosource code. It is widely used in various domains where thereis limited availability of source code, e.g., due to proprietaryissues or simply impossible to trace any source code. Today,BCA has become more important than ever due to legacyprograms that have been installed in a variety of environments, including the Internet of Things (IoT). It is currentlyestimated that there are more than 20 billion IoT devicesworldwide by 2020 [67]. BCA can be useful for IoT andother mission-critical environments (e.g., defense, hospitals)and provide key tools for improving software security in suchplaces [56], [115], [129].BNote that it is difficult to directly analyze binary executables when compared to program source code. First, it ischallenging, if not impossible at all, to recover the originalsource code or semantic information from the representationof binary code. Second, commercial software and operating systems are usually slightly obfuscated to deter reverseengineering and unlicensed use. On the other hand, systemand kernel libraries are often optimized to reduce disk spacerequirements. For instance, it may be difficult to locateVOLUME 4, 2016function entry points (FEPs) since the full symbol or debuginformation is usually not available in optimized binaries.Recently, machine learning techniques have been employed to automatically extract features through largeamounts of data and have achieved significant success in thefield of source code analysis [5], [117], [142], [144]. Inspiredby those prior works on the source code, machine learningbased BCA has also been widely studied as well. Tesauroet al. [123] first introduced a neural network based methodfor recognizing virus in application binaries. Since then, machine learning-based BCA has become a significant researchtopic in vulnerability detection, function recognition, andother areas.In this article, we discuss several aspects of binary codeanalysis and outline the machine learning algorithms usedin such analyses. We provide a taxonomy of machinelearning-based binary code analysis techniques and discussthe key challenges and opportunities. Finally, we present ourthoughts for future directions on this topic.II. BACKGROUNDBCA is a key requirement for many software engineeringtasks that include:1

pushmovmovmovmovmovmovsarcmpjzebxeax, [esp 4 arg 0]edx, [eax 58h]ebx, [edx 344h]edx, [eax]eax, [ebx 24h]ecx, edxecx, 8ecx, 3short loc 80B6412cmpjleedx, 302hshort loc 80B640fpop ebxretncmp edx, 302hmov edx, 20080hcmovz eax, edxpop ebxretnA. x86 assemblylwlwlwsralibnelw v0, 0x47( a0) v1, 0( a0) v0, 0x344( v0) a1, v1,6 a0, 2 a1, a0, locret 19830 v0, 0x31( v0)slti v1, 0x303bnez v1, locret 19830li v1, 0xC030bne v0, v1, locret 19830nopla v0, loc 20080jr ranopB. MIPS assemblyFIGURE 1. The control flow graph comparison for the vulnerable function CVE-2013-6449 under different architectures. [46] vulnerability discovery (including integer, heap or stackbuffer overflow, denial-of-service attack, Input manipulation, authentication bypass vulnerabilities)refactoring (a restructuring process of changing a program binary to creates new versions that implement orpropose change to the program binary while preservingthe program’s external behavior (functionality and semantics))plagiarism detection (detecting a plagiarized program,which is a program that has been generated from another program with trivial text edit operations).Especially with the rapid growth of IoT devices and thecomplexity of software applications, program binaries are often shared among multiple platforms. Imagine a single bug isinjected at source code level, it may spread across thousandsor more devices that have diverse hardware architecturesand software platforms. Thus, binary analysis is particularlycritical and more crucial than ever.BCA is used to provide information about a program’scontent (instructions, basic blocks, functions, and modules),structure (control and data flow), and data structures (globaland stack variables), and is, therefore, a foundation of manysecurity applications. Code Clone Detection in binaries seeksto find code sequence used more than once, copy/paste orreused code, or same source code but compiled under different Instruction Set Architectures [19], [66], [110], [150].2Malware detection detects malicious programs which havevulnerabilities inside and will cause damage to systems orprograms crashes [10], [12], [18], [32], [36], [41], [58], [82],[126], [127], [146]. Code obfuscation translates the originalprogram into another one preserving its function but makingit hard for analysis [74], [86], [114], [119], [135]. Binaryreverse engineering translates the binary program into highlevel readable language, such as converting binaries backto source code [29], [79], [85], [118]. Binary customization directly changing the binary program through binaryrewriting, removing unused or vulnerable codes/functionsin program binaries [25], [26]. Recent studies have shownthat software can manipulate hardware features as well to domalicious activities like information leakage [2], [145], [147]and memory corruption [23], [24], [119]. Such studies callfor a more thorough understanding of software binaries [95],[128] and assessing their potential to be exploited in securityattacks.A. KEY CHALLENGESLegacy program binaries exist in many production systems,e.g., airspace, military, and banking. To detect bugs or analyze software system safety, executable binary codes arethe only source of information about program content andbehavior. The compile, link, and optimize steps can cause aprogram’s detailed execution behavior to differ substantiallyfrom its source code. Although it is easy to compile sourceVOLUME 4, 2016

codes into binary executables, it is hard to reverse the binarycodes back to source codes for further analysis. Often, it isrelatively better to have binary executables be analyzed intointermediate representations (e.g., binary assembly code),which consist of limited code syntax and semantic featurescomparing to the source code.We note that the source code can be compiled in differentplatforms (e.g. x86, MIPS). This leads to the syntactic binaryrepresentations to be very different for the same programcompiled on two platforms, bearing very different structures.Such cross-platform binary code analysis problems havebeen tackled, only recently [43], [47]. These efforts useframeworks to extract various robust platform-independentfeatures directly from binary code. We will discuss moredetails in Section VI.Example. A vulnerable function in the source code maypropagate into hundreds or more IoT devices that have diverse hardware architectures and software platforms afterbeing compiled using the same source code. For instance,a real-world denial-of-service attack vulnerability CVE2013-6449 has completely different control flows whencompiled under x86 and MIPS architecture as shown inFigure 1, even though they both are compiled from thesame source code from OpenSSL library (version 1.0.1a).To deal with these challenges, the traditional approachesfor BCA are mainly through pure statistical methods or pureformal analysis (e.g., binary symbolic execution). Statisticalmethods depend on logging of program states for analysis. Inreality, the program execution often yields partial/incompletelogs, it becomes extremely hard for statistical methods aloneto accurately achieve the goal through binary code analysis.For example, in vulnerability discovery, it could easily missvulnerable paths (false negatives) due to inadequate statisticalprofile data and low code analysis coverage. On the otherhand, pure formal analysis can guarantee no false positiveswith a better analysis accuracy and significantly improvecode coverage. However, it is always a time-consuming approach and cannot be deployed in large scale programs dueto exponential state increase (path explosion) problem.B. GENERAL FRAMEWORKTo overcome the difficulty of processing binary codes andperform the code analysis in an automated fashion, Machine Learning techniques have been adopted. Several machine learning-based code analysis frameworks have beendeployed at the source code level that adapt natural languageprocessing (NLP) techniques at source code for different purposes. For example, White et al. [133] introduce a learningbased framework for code similarity detection in the sourcecode, where Recursive Neural Network (RNN) is deployedto profile code sequences.A machine learning-based BCA framework has generallytwo stages: the training stage and the analysis stage. Thegeneral framework of machine learning based binary codeanalysis is shown in Figure 2. In the training stage, the targetVOLUME 4, 2016binary code will be analyzed first and unique features will beextracted from the analysis results. Then, the Machine learning algorithm will be trained with input features and desiredbinary code knowledge. In the analysis stage, given a targetbinary code, features will be extracted after program analysis(e.g. lexical analysis and syntax analysis). According to thetraining result, machine learning models can be used toidentify such features to achieve certain analyzed goals suchas discovering vulnerable paths, finding performance bugsand imbalance and so on.C. THE TAXONOMYAccording to the general framework shown in Section II-B,existing machine learning-based BCA systems can bebroadly divided into four major components. In Figure 3,we present the taxonomy of machine learning-based BCAframework in detail. The remaining sections of this paperdiscuss the various existing methods from the perspectiveof feature extraction, feature embedding, analysis techniquesused in BCA and corresponding applications in the realworld.1) Feature ExtractionDifferent from applying machine learning in NLP tasks,optimized program binary codes contain a huge vocabulary block of codes (e.g. basic blocks) that have complexrelationships and less redundancy in terms of syntax andsemantic information compared to human languages or eventhe original program source codes. The first requirementof machine learning-based BCA is to extract features thatrepresent binary code. Here, we further divide the featuresextracted from program binaries into different categories.1. Graph-based feature. Program Binaries can be represented as a program flow graph, such as control flowgraph (CFG). We can use such graph to extract graphbased features [43], [47], [65], [136].2. Code-based feature. Here, we extract the features directly from raw binary code. We summarize codebased feature into two different levels: token-level andinstruction-level.2.1. Token-level features. Tokens (e.g., words, characters,or symbols) from binary code to extract tokens-basedfeature using decompiler or binary disassembly tools,such as IDA Pro [102], OllyDbg [148] and capstone [90].2.2. Instruction-level features. Every machine-level instruction in binary executables is a combination oftokens, such as memory references, registers, and immediate values. Such instructions sequence can alsobe used as program features for analyzing programbugs, information flow and so on [95], [126].2) Feature EmbeddingThe extracted raw features from binaries cannot be fed intomachine learning modules directly since machine learning3

Training StageTraining setBinaryAnalysisDesired binary ithmTrainingAnalysis StageTargetBinaryAnalysisFeatureExtractionMachine LearningModelsResultFIGURE 2. General framework of machine learning based binary code analysis.needs numerical data as inputs (e.g. vectors). Thus, a generalphase is to transfer features to feature vectors or some formalrules, this phase is commonly named as feature embedding.1. Graph Embedding. Graph embedding networks havebeen proposed for classification domains such asmolecule classification [35], which we can covert graphbased features into graphs.2. Code Embedding. Based on existing sequence-tosequence models (e.g., Recursive Neural Network),These and related models are well-suited to tasks thathave tokens or instructions as inputs and embed theminto vector space [72], [103], [120].3) Analysis TechniquesAfter feature extraction and feature embedding, we needto choose a suitable machine learning algorithm for furtherprogram analysis. Existing analysis techniques can be classified into three categories: supervised learning, unsupervisedlearning, and deep learning. We will discuss each of them indetail in Section V.4) ApplicationsBCA for malware detection is a widely developed area ofresearch. Rieck et al. [103] introduced Malware behavioranalysis based on machine learning. Liangboonprakong etal. [83] create a method to classify Malware families usingMachine learning. Rosenblum et al. [105] recover tool chainprovenance to record the compilation-related information.discovRE [43] and Genius [47] develop tools that can identifybugs automatically via clustering the already known vulnerabilities.4The remainder of this article is organized as follows. InSection III and Section IV, we discuss the overall principlesand strategies of feature extraction and feature embeddings.Sections V address the commonly used machine learningmodels that are deployed in binary code analysis and theywill be further compared, while Section VI discusses theapplication of machine learning based code analysis for realworld problems. Section VII discusses how recent advancesin other areas could be applied to enhance binary codeanalysis. Concluding remarks are presented in Section VIII.III. FEATURE EXTRACTIONIn this section, we list the various types of features that canbe extracted from binaries. In general, the goal of featureextraction is to automatically link binary code patterns minedat the lexical level with patterns mined at the syntactic level.A. GRAPH-BASED FEATURESThe Program flow graph (e.g., control flow graph, datadependency graph) is the common feature used in variousapproaches of BCA. Especially for the cross-platform bugsearch problem, the program flow graph and the basic blockmargins typically remain equivalent (or at least similar) incross-compiled code. Thus, such graph-based features areadaptive by design and with high efficiency for large-scaleBCA applications.1) Abstract Syntax TreeAbstract Syntax Tree (AST) is typically used by compilersto represent the structure of program code, and to analyzethe dependencies between variables and statements. ManyVOLUME 4, 2016

Abstract SyntaxControl Flow GraphGraph-based FeaturesFeature ExtractionToken-level FeaturesTokensN-gramPrintable StringsPortable ExecutableInstruction-level FeaturesFeature Entry PointIdiomHexdecimal Byte SequencesThe Malware Instruction SetInstructionsCode-based FeaturesFeature EmbeddingMachine Learning-basedBCAAnalysis TechniquesApplicationsGraph EmbeddingCode EmbeddingFrequency EmbeddingByte EmbeddingInstruction Q-gram EmbeddingSupervised LearningSupport Vector MachineK-nearest neighborDecision TreeRandom ForestBoostingBayers classifierUnsupervised LearningClusteringRule-based LearningWeighted Prefix TreeDeep LearningDeep Neural NetworkRecurrent Neural NetworkMultiple Layer PerceptronOne-Sided PerceptronCryptographic Algorithm ClassificationRecovering Toolchain ProvenanceBinary Code Clone DetectionDecompilationFunction RecognitionAuthorship RecognitionMalware DetectionVulnerability DiscoveryFIGURE 3. The taxonomy of machine learning-based BCA framework.works adopt AST in source code level syntax extractionanalysis [20], [34], [72], [76], [93].AST can also be used in machine learning-based BCA. Forinstances, a function recognition tool, FID [130], first translates each instruction within a basic block into assignmentformulas which can represent the data flow exchanging andthe semantics of each basic block. Then, each data movement between registers and/or memory (assignment) will betranslated into a syntax tree, which can be further convertedinto a numerical vector by calculating the maximum levels ofnested parentheses and the maximum depth of an AST as twosyntactic features.Example. As shown in Figure 4, (A) shows a functionentry point basic block and its corresponding data movement operations (register/memory copy, assignments andcomputations), while (B) shows memory access dependency behavior between caller’s and callee’s basicblocks.2) Control Flow GraphThe concept of Control Flow Graph (CFG) is first introducedby Frances et al. [4]. CFG represents a graph of all possibleexecution paths that might be traversed through a programduring its execution. To construct a basic static binary CFG,function entry and exit point should be found first. In afunction, a sequence of consecutively executing instructions(defined as nodes) and control flow transfers between suchVOLUME 4, i 0x4c, %eax0x68(%esp), %eax0x64(%esp), %edx0x60(%esp), %esi%edx, %edxmovmovmovcall 0x805248e, 0x4(%esp)-0x10(%ebp), %eax%eax,(%esp)c strcasecmp c strcasecmp :push%ebpmov%esp, %ebppush%esipush%ebxsub%0x20, %espmov0x8(%ebp), %esimov0xc(%ebp), %ebxcmp %ebx, %esijne0x804cf85eaxebxecxedxesiediebpesp mem1 reg2 reg3 mem2 mem3 reg6 reg7 reg8 - 92eaxebxecxedxesiediebpesp mem1 reg2 reg3 reg4 reg5 reg6 reg7 reg8A. stack register adjustments in afunction entry pointB. Memory access behaviors in aninter-procedure control transfer.eax reg1ebx mem1ecx reg3edx reg4esi mem2edi reg6ebp reg8 - 4esp reg8 – 44[reg8 4] mem3[reg8 8] mem4FIGURE 4. Instruction and its corresponding assignment formulas. [130]sequences via jumps, function calls and returns (defined asedges). A binary CFG is constructed to capture the relationship between nodes and edges in the underlying binaryprogram. It is a useful method to reflect internal relationsbetween basic blocks. For this purpose, plenty of works havediscussed CFG extraction from binary codes as programfeatures, such as Theiling et al. [124], Kinder et al. [75],Kruege et al. [77], Yadegar [141].5

entryblk1blk2CFG unigrams:blk1blk2blk4.blk3blk3CFG bigrams:cpuidjmp L2 L1:cmp ecx,edxjle L1L2:mov eax, A. code exampleB. GraphletexitA.Control-flow graphB.Control-flow featuresFIGURE 6. Example code and its corresponding grapglet. [107]FIGURE 5. Control Flow Features.

Code Clone Detection in binaries seeks to find code sequence used more than once, copy/paste or . level readable language, such as converting binaries back to source code [29], [79], [85], [118]. Binary customiza- . Such cross-platform binary code analysis problems have been tackled, only recently [43], [47]. These efforts use

Related Documents:

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

Machine learning has many different faces. We are interested in these aspects of machine learning which are related to representation theory. However, machine learning has been combined with other areas of mathematics. Statistical machine learning. Topological machine learning. Computer science. Wojciech Czaja Mathematical Methods in Machine .

Machine Learning Real life problems Lecture 1: Machine Learning Problem Qinfeng (Javen) Shi 28 July 2014 Intro. to Stats. Machine Learning . Learning from the Databy Yaser Abu-Mostafa in Caltech. Machine Learningby Andrew Ng in Stanford. Machine Learning(or related courses) by Nando de Freitas in UBC (now Oxford).

Machine Learning Machine Learning B. Supervised Learning: Nonlinear Models B.5. A First Look at Bayesian and Markov Networks Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL .

with machine learning algorithms to support weak areas of a machine-only classifier. Supporting Machine Learning Interactive machine learning systems can speed up model evaluation and helping users quickly discover classifier de-ficiencies. Some systems help users choose between multiple machine learning models (e.g., [17]) and tune model .

Artificial Intelligence, Machine Learning, and Deep Learning (AI/ML/DL) F(x) Deep Learning Artificial Intelligence Machine Learning Artificial Intelligence Technique where computer can mimic human behavior Machine Learning Subset of AI techniques which use algorithms to enable machines to learn from data Deep Learning

2. Machine Learning Workflow By Steps 3. Four Groups Of Task That Machine Learning Solves 3.1 Classification 3.2 Cluster analysis 3.3 Regression 3.4 Ranking 3.5 Generation 4. Three Model Training Styles 4.1 Supervised learning 4.2 Unsupervised learning 4.3 Reinforcement learning 5. Embarking On Machine Learning 5.1 "Business translator" and .

a) Plain milling machine b) Universal milling machine c) Omniversal milling machine d) Vertical milling machine 2. Table type milling machine 3. Planer type milling machine 4. Special type milling machine 5.2.1 Column and knee type milling machine The column of a column and knee