YARA-Signator2019-12-06 Botconf, BordeauxAutomated Generation of Code-based YARA RulesFelix Bilstein@fxb b1 Cyber Analysis and Defense Department, Fraunhofer FKIEDaniel Plohmann@push pnx
Introduction Felix Bilstein Student @ University Bonn, Research Assistant @ Fraunhofer FKIE Daniel Plohmann Reverse Engineering & Malware Research @ Fraunhofer FKIE2 Cyber Analysis and Defense Department, Fraunhofer FKIE
TheAgenda3 Cyber Analysis and Defense Department, Fraunhofer FKIE
Agenda Motivation Approach Evaluation Future Work4 Cyber Analysis and Defense Department, Fraunhofer FKIE
OurMotivation5 Cyber Analysis and Defense Department, Fraunhofer FKIE
MotivationYARA Classification of malware is essential for effective processing, e.g. Analysis speedup through contextualization Automated extraction of configurations YARA is a de-facto standard: a highly efficient pattern matching engine and rule language „Easy to learn, hard to master“ Effective rule creation is still „expensive“ (experience expert knowledge)6 Cyber Analysis and Defense Department, Fraunhofer FKIE
MotivationRules: State of the Art Currently, most publicly available rules are majorily composed by (text) strings: Rule sets: Mike Worth [1], Florian Roth [2], YaraRules [3], deadbits [4], [redacted], Files: 2,516, Rules: 26,515 73,295 (75.25%) text strings, 23,367 (23,99%) bytes, 736 (0,76%) regex Potential advantages of code-based (byte) rules: Typically robust when targeting the „right“ code areas Harder to circumvent by malware authors(?) Automation scales better than manual effort :)[1] [2] https://github.com/Neo23x0/signature-base7 [3] https://github.com/Yara-Rules/rules Cyber Analysis and Defense Department, Fraunhofer FKIE[4] https://github.com/deadbits/yara-rules
MotivationAutomated Rule Generation / Related Work Limited number of tools for automated rule generation: Blichmann: vxsig [1] / Zaddach&Graziano: BASS [2] Roth: yarGen [3] Doman: YaBin [4][1] https://github.com/google/vxsig[2] https://github.com/Cisco-Talos/BASS8 [3] https://github.com/Neo23x0/yarGen Cyber Analysis and Defense Department, Fraunhofer FKIE[4] https://github.com/AlienVault-OTX/yabin
MotivationYARA-Signator Practical usage example of the data contained in Malpedia [1]: Started as BA thesis [2,3], continued as MA lab Automated creation of YARA rules![1] https://malpedia.caad.fkie.fraunhofer.de[2] http://cocacoding.com/papers/Automatic Generation of code based YARA Signatures.pdf9 [3] https://github.com/fxb-cocacoding/yara-signator Cyber Analysis and Defense Department, Fraunhofer FKIE
Approach10 Cyber Analysis and Defense Department, Fraunhofer FKIE
ApproachObjectives Goal: Generate accurate YARA rules for as many families in Malpedia as possible Automate YARA string (i.e. byte sequence) selection procedure Quality assurance Method: Disassembly - Shingling - Aggregation Result: YARA-Signator [1][1] https://github.com/fxb-cocacoding/yara-signator11 Cyber Analysis and Defense Department, Fraunhofer FKIE
ApproachModular Procedure Approach: Disassemble all unpacked/dumped samples in Malpedia using SMDA [1], then [1] https://github.com/danielplohmann/smda12 Cyber Analysis and Defense Department, Fraunhofer FKIE
ApproachModular Procedure Approach: Disassemble all unpacked/dumped samples in Malpedia using SMDA, then SMDA produces disassembled ICFG in JSON but YARA matches linear byte sequences.- Linearization & Tokenization into 4/5/6/7 instruction Ngrams13 Cyber Analysis and Defense Department, Fraunhofer FKIE
ApproachModular Procedure Approach: Disassemble all unpacked/dumped samples in Malpedia using SMDA, then Unpacked / Dumped samples are typically affected by relocations.Wildcard those addresses for better rule generalization.Target CFG instructions and data pointer references. sequence wildcarding disabled { 8b0da4724100 8b5dfc 50 e8c1feffff }// 8b0da4724100 mov ecx, dword ptr [0x4172a4]// 8b5dfc mov ebx, dword ptr [ebp ‐ 4]// 50 push eax// e8c1feffff call 0xfffffec6 sequence wildcarding cfg ptr { 8b0d? 8b5dfc 50 e8? }// 8b0d? mov ecx, dword ptr [?]// 8b5dfc mov ebx, dword ptr [ebp ‐ 4]// 50 push eax// e8? call ?14 Cyber Analysis and Defense Department, Fraunhofer FKIE
ApproachModular Procedure Approach: Disassemble all unpacked/dumped samples in Malpedia using SMDA, then The actual „magic“: Data deduplication Keep only these Ngrams that occur in a single family.15 Cyber Analysis and Defense Department, Fraunhofer FKIE
ApproachModular Procedure Approach: Disassemble all unpacked/dumped samples in Malpedia using SMDA, then The actual „magic“: Data deduplication Keep only these Ngrams that occur in a single family.16 Cyber Analysis and Defense Department, Fraunhofer FKIE
ApproachModular Procedure Approach: Disassemble all unpacked/dumped samples in Malpedia using SMDA, then The actual „magic“: Data deduplication Keep only these Ngrams that occur in a single family.Table: Percentage of unique Ngrams, aggregated on family level [1][1] http://cocacoding.com/papers/Automatic Generation of code based YARA Signatures.pdf17 Cyber Analysis and Defense Department, Fraunhofer FKIE
ApproachModular Procedure Approach: Disassemble all unpacked/dumped samples in Malpedia using SMDA, then Among candidate Ngrams per family, select those with best coverage.Require 7 sequences per sample.Compose and validate YARA rules.First round: Remove FP sequences („any of them“).Second round : Increase coverage by selecting more sequences.18 Cyber Analysis and Defense Department, Fraunhofer FKIE
ApproachModular Procedure Approach: Disassemble all unpacked/dumped samples in Malpedia using SMDA, then rule win corebot auto {meta:author "Felix Bilstein ‐ yara‐signator at cocacoding dot com"description "autogenerated rule brought to you by yara‐signator"tool "yara‐signator 0.2a"malpedia version "20190620"malpedia license "CC BY‐SA 4.0"malpedia sharing "TLP:WHITE"strings: sequence 0 { 7410 85c0 740c 50 } sequence 1 { 895e0c 8b03 894604 83e880 }[.] sequence 9 { 8b00 894614 83e880 894618 }condition:7 of them}19 Cyber Analysis and Defense Department, Fraunhofer FKIE
ApproachImplementation & Performance Implementation as modular framework using: Java Postgres YARAPerformance (full run on data set explained in the following): Hardware: Intel I7, 32GB RAM, 1 HDD SSDUnmasked (in hours)Wildcarded (in hours)Parsing, Linearization, Ngrams, Wildcarding6.55.5Filtering2.52First Round42Following Rounds11Total1410.520 Cyber Analysis and Defense Department, Fraunhofer FKIE
Evaluation21 Cyber Analysis and Defense Department, Fraunhofer FKIE
EvaluationData Sets Malpedia [1] Curated, free, high-quality malware corpus for research Snapshot: 2019-10-21 09:13:52 (commit: d006d14) empty msvc [2] Empty Visual Studio Projects for all versions (VS6-VS2019), built with different bitness and compiler settings Ground-truth for the most common statically linked code[1] https://malpedia.caad.fkie.fraunhofer.de[2] https://github.com/danielplohmann/empty msvc22 Cyber Analysis and Defense Department, Fraunhofer FKIE
EvaluationCode Statistics Malpedia [1] (commit: d006d14, date: 2019-10-21)FamiliesSamplesTotal1,4474,237 (8,508 Files)Processable (unpacked)1,0853,159 (4,575 Files)Detectable9492,916 (3,978 Files) Code Statistics:Averaged over 157,806,663[1] https://malpedia.caad.fkie.fraunhofer.de23 Cyber Analysis and Defense Department, Fraunhofer FKIE
EvaluationNgram Statistics Building Ngrams of length 4-7 instructions Data Reduction through Ngram aggregation: Already uniquified per sample while parsingRawDistinctAggregated(over all samples)(occurrence in one family carded476,110,027191,035,382170,868,100 Observations: Unique wildcarded Ngrams are significantly less compared to unmasked Family-based code-isolation leaves way larger pool of Ngrams than initially expected24 Cyber Analysis and Defense Department, Fraunhofer FKIE
EvaluationRule StatisticsTotal sequences in all rules: 11,825 YARA Signator Output: 949 Rules736/949 rules(77.56%)min225%1050%1075%10max235Sequences per ruleWildcarded: 5,765 (48,75%)rule yara‐signator {meta:description “rule statistics"strings: sequence 0 { 7410 85c0 740c 50 } sequence 1 { 895e0c 8b03 894604 e8? 83e880 }[.] sequence 9 { 8b00 894614 83e880 894618 }condition:7 of them}BytesMin25%50%75%Max414182370Bytes per sequence25 Cyber Analysis and Defense Department, Fraunhofer FKIE
EvaluationClassification Performance FamiliesSamplesTotal1,4474,237 (8,508 Files)Processable1,0853,159 (4,575 Files)Detectable9492,916 (3,978 9Stats: Rule Performance: Rules without FPs: 924 Rules without FNs: 844 „Clean“ Rules: 840Reasons for PPV / Precision: 0.987 TPR / Recall: 0.945 Disassembly inaccuracies F1: 0.966 Groundtruth / Labeling False Positives:False Negatives: 26 Cyber Analysis and Defense Department, Fraunhofer FKIEModules excluded from procedure
EvaluationFalse Positive Analysis vs. Avast 10TB Goodware Data Set Avast generously supported our research by running rules against one of their clean data sets. Previous performance evaluation (snapshot July 2nd 2019): Rules for 877/1320 families F-Score: 0.977 False Positive Analysis vs. Avast Goodware Data Set (10TB): Total FPs: 129,267 From 100/877 YARA rules trigger false positives. 23 of 100 are below 10 FPs 67 of 100 are below 100 FPs 87 of 100 are below 1,000 FPs 98 of 100 are below 10,000 FPs YARA signature for ”win.quantloader“ triggers 51,819 (FP) hits on the data set Rules for which significant FPs are reported get removed from Malpedia27 Cyber Analysis and Defense Department, Fraunhofer FKIE
EvaluationInteresting FP: win.tinynuke - win.unidentified 068 yara –C malpedia auto.yac malpedia/win.unidentified 068 ‐r –s sortwin tinynuke auto /malpedia/win.unidentified 068/[redacted] dump7 0x004000000x7431: sequence 2: 89 44 24 1C 2B 58 34 83 3F 00 74 5F0x7446: sequence 3: 83 F8 08 72 46 83 C0 F8 D1 E8 89 440x746b: sequence 5: 83 F8 03 74 13 83 F8 0A 75 15 8B 070x7495: sequence 6: 03 38 83 3F 00 75 A1 8B 5E 08 89 5C0x749f: sequence 1: 89 5C 24 10 8B 4B 0C 85 C9 0F 84 A20x74bf: sequence 7: 8D 44 24 34 50 8D 44 24 28 50 8B 460x74f2: sequence 4: 03 1E 8B 0F 85 C9 74 40 79 050x7548: sequence 0: 85 C9 0F 85 5E FF FF FF 8B 44 24 1C28 Cyber Analysis and Defense Department, Fraunhofer FKIE8D24032400104710061000FF8B 0E04 89 44 24 14BA 00 00 00 00 74 368B 4B 0C00 8B 06 03 C1 50D0 8D 44 24 14
EvaluationInteresting FP: win.tinynuke - win.unidentified 068Found reuse of previously „unique“ code, yay! yara –C malpedia auto.yac malpedia/win.unidentified 068 ‐r –s sortwin tinynuke auto /malpedia/win.unidentified 068/[redacted] dump7 0x00400000Meanwhile identified by Proofpoint0x7431: sequence 2: 89 44 24 1C 2B 58 34 83 3F 00 74 5F 8D 4704 89[1]44(loader)24 14„Buer“0x7446: sequence 3: 83 F8 08 72 46 83 C0 F8 D1 E8 89 44 24 10 BA 00 00 00 00 74 360x746b: sequence 5: 83 F8 03 74 13 83 F8 0A 75 15 8B 07 03 060x7495: sequence 6: 03 38 83 3F 00 75 A1 8B 5E 08 89 5C 24 10 8B 4B 0C0x749f: sequence 1: 89 5C 24 10 8B 4B 0C 85 C9 0F 84 A2 00 00 00 8B 06 03 C1 500x74bf: sequence 7: 8D 44 24 34 50 8D 44 24 28 50 8B 46 10 FF D0 8D 44 24 140x74f2: sequence 4: 03 1E 8B 0F 85 C9 74 40 79 050x7548: sequence 0: 85 C9 0F 85 5E FF FF FF 8B 44 24 1C 8B 0E29 [1] buer-new-loader-emerges-underground-marketplace Cyber Analysis and Defense Department, Fraunhofer FKIEas
EvaluationDiscussion / Lessons Learned Lots of family-unique Ngrams available! This massively benefits rule generation (probably also code similarity analysis) Input data quality is essential: Disassembly errors - False Positives Insufficient example coverage leads to inferior rules: 64bit Static linking: Delphi, Go Biggest rule quality improvement: In Ngram selection process, exclude overlaps![1] https://malpedia.caad.fkie.fraunhofer.de[2] https://github.com/danielplohmann/empty msvc30 Cyber Analysis and Defense Department, Fraunhofer FKIE
Future Work31 Cyber Analysis and Defense Department, Fraunhofer FKIE
Future WorkPotential Improvements Support more architectures (e.g. ARM, MIPS) or input formats „Daemonization“ Periodic (daily/weekly?) runs for Malpedia Work on coverage maximization for prevalent families Further evaluation Minimize signatures (less sequences, ) Compare usage of raw bytes versus instruction ngrams?32 Cyber Analysis and Defense Department, Fraunhofer FKIE
Thank You for Your Attention!Felix Bilsteinfxb@cocacoding.com@fxb b33 Cyber Analysis and Defense Department, Fraunhofer FKIEDaniel Plohmanndaniel.plohmann@fkie.fraunhofer.de@push pnx@malpedia
Evaluation32bitBONUS: Instruction StatisticsCount Most common mnemonics Difference: 32bit has extensive stackusage, not so 64bit (among other thingsbecause of fastcall calling convention)Apart from order, mostly the same for32bit & 723441.450087826190.659033Most common mnemonics34 Cyber Analysis and Defense Department, Fraunhofer FKIE
Evaluation32bitBONUS: Instruction StatisticsCount Semantic Class 32649.547899C42,619,83129.3729103,952,59331.104279 32bit: extensive stack usageS29,570,12220.379258425,9623.352038 64bit: fast-call - memory ‐ Classes:M: MemoryC: CFGS: StackA: Logic/ArithmeticX: Extended (MMX, SSE, )F: FloatN: NopP: PrivilegedY: CryptoV: VMX35 Cyber Analysis and Defense Department, Fraunhofer FKIESemantic Classes
Evaluation32bitBONUS: Instruction StatisticsCount Instruction Length Distribution Less 1 Byte instructions on 64bit: Because 0x4? Instructionsused as 64bit „marker“ Operand Count Distribution 3 4 operands mostly found in extended instructionsets (MMX, SSE, )# 795432090.002034159680.000667600.000472Operand Count Statistics36 Cyber Analysis and Defense Department, Fraunhofer FKIEInstruction Length Statistics (in Bytes)
EvaluationBONUS: Instruction Statistics Instruction First Byte HeatmapsReference (32bit) [1][1] https://net.cs.uni-bonn.de/fileadmin/user upload/plohmann/x86 opcode structure and instruction overview.pdf37 Cyber Analysis and Defense Department, Fraunhofer FKIEHeatmap (32bit)Heatmap (64bit)
From 100/877 YARA rules trigger false positives. 23 of 100 are below 10 FPs 67 of 100 are below 100 FPs 87 of 100 are below 1,000 FPs 98 of 100 are below 10,000 FPs YARA signature for ”win.quantloader“ triggers 51,819 (F
YARA-Signator, an approach for the automated generation of code-based YARA rules. The method is based on the isolation of instruc-tion n-grams that on the one hand appear fre-quently within a malware family and on the other hand are not found in any other family. Applying YARA-Signator to t
1 Introduction 1 2 Basic Notions 3 . YARA, one must have a set of YARA rules, which are applied to the malware samples. A YARA rule is a text le containing a set of detection strings (printable strings, byte sequences, or regular expressions) and a matching condition. YARA parses
1 INTRODUCTION YARA, a world leader in ammonia and fertiliser production and trading, operates two plants in the Pilbara region of Western Australia. One plant, called YARA Pilbara Fertilisers (YPF), is fully owned by YARA, producing ammonia usi
An Introduction to YARA YARA, short for “yet another recursive acronym”, is a tool used in malware detection and classification. Malware researchers leverage YARA to create descriptions of malware families based on textual
Introduction What YARA is NOT Not a virus scanner Not a correlation engine Not a bayesian classifier No artifical intelligence (AI) involved. Introduction How can YARA help me? A „better grep“ Use cases:File Size: 736KB
Yara, the company Introduction Yara International ASA is a Norwegian chemical company. Its largest business area is the production of nitrogen fertilizer, however it also encompasses the production of nitrates, ammonia, urea and other nitrogen-based chemicals. It is the second
with a YARA rule in our test setup is five orders of magnitude faster than using standard sequential YARA scans. 1 Introduction As a core part of their threat intelligence, the security industry close
This dissertation is about the Loyalist Regiments of the American Revolution, 1775-1783. These were the formal regiments formed by the British, consisting of Americans who stayed Loyal to the British crown during the American Revolutionary War. They fought in most of the main campaigns of this war and in 1783 left with the British Army for Canada, where many of them settled. The Loyalist .