Securing Your Big Data Environment - Black Hat Home

2y ago
13 Views
2 Downloads
979.55 KB
8 Pages
Last View : 30d ago
Last Download : 3m ago
Upload by : Mollie Blount
Transcription

Securing Your Big Data EnvironmentAjit Gaddamajit@root777.comAbstractSecurity and privacy issues are magnified by the volume,variety, and velocity of Big Data. The diversity of datasources, formats, and data flows, combined with thestreaming nature of data acquisition and high volumecreate unique security risks.This paper details the security challenges whenorganizations start moving sensitive data to a Big Datarepository like Hadoop. It identifies the different threatmodels and the security control framework to address andmitigate security risks due to the identified threatconditions and usage models. The framework outlined inthis paper is also meant to be distribution agnostic.Keywords: Hadoop, Big Data, enterprise, defense, risk, BigData Reference Framework, Security and Privacy, threatmodel1IntroductionThe term “Big Data” refers to the massive amounts ofdigital information that companies collect. Industryestimates on the growth rate of data is roughly doubleevery two years, from 2500 Exabytes in 2012 to 40,000Exabytes in 2020 [1]. Big data is not a specifictechnology. It is a collection of attributes andcapabilities.NIST defines Big Data as the following [2]:Big Data consists of extensive datasets, primarily in thecharacteristics of volume, velocity, and/or variety thatrequire a scalable architecture for efficient storage,manipulation, and analysis.Securosis research [3] adds additional characteristicsfor a particular environment to qualify as ‘Big Data’.1. It handles a petabyte of data or more2. It has distributed redundant data storage3. Can leverage parallel task processing4. Can provide data processing (MapReduce orequivalent) capabilities5. Has extremely fast data insertion6. Has central management and orchestration7.8.Is hardware agnosticIs extensible where its basic capabilities can beaugmented and alteredSecurity and privacy issues are magnified by thevolume, variety, and velocity of Big Data. The diversity ofdata sources, formats, and data flows, combined with thestreaming nature of data acquisition and high volumecreate unique security risks.It is not merely the existence of large amountsof data that is creating new security challenges fororganizations. Big Data has been collected and utilizedby enterprises for several decades. Softwareinfrastructures such as Hadoop enable developers andanalysts to easily leverage hundreds of computing nodesto perform data-parallel computing which was not therebefore. As a result, new security challenges have arisenfrom the coupling of Big Data with heterogeneouscompositions of commodity hardware with commodityoperating systems, and commodity softwareinfrastructures for storing and computing on data. As BigData expands at the different enterprises, traditionalsecurity mechanisms tailored to securing small-scale,static data and data flows on firewalled and semiisolated networks are inadequate. Similarly, it is unclearhow to retrofit provenance in an enterprise’s existinginfrastructure. Throughout this document, unlessexplicitly called out, Big Data will refer to the Hadoopframework and its common NoSQL variants (e.g.Cassandra, MongoDB, Couch, Riak, etc.).This paper details the security challenges whenorganizations start moving sensitive data to a Big Datarepository like Hadoop. It provides the different threatmodels and the security control framework to addressand mitigate the risk due to the identified securitythreats. In the following sections, the paper describes inthe detail the architecture of the modern Hadoopecosystem and identify the different securityweaknesses of such systems. We then identify thedifferent threat conditions associated with them andtheir threat models. This paper concludes the analysis byproviding a reference security framework for anenterprise Big Data environment.

2 Hadoop Security WeaknessCryptographically enforced data centricsecurity Granular access controlData Management Secure data storage and transactions logs Granular audits Data provenanceIntegrity & Reactive Security End-point input validation/filtering Real-time security monitoringTraditional Relational Database Management Systems(RDBMS) security has evolved over the years and withmany ‘eyeballs’ assessing the security through varioussecurity evaluations. Unlike such solutions, Hadoopsecurity has not undergone the same level of rigor orevaluation for that matter and thus can claim littleassurance of the implemented security.Another big challenge is that today, there is nostandardization or portability of security controlsbetween the different Open-Source Software (OSS)projects and the different Hadoop or Big Data vendors.Hadoop security is completely fragmented. This is trueeven when the above parties implement the samesecurity feature for the same Hadoop component.Vendors and OSS parties’ force-fit security into theApache Hadoop framework.2.1Top 10 Security & PrivacyChallengesFigure 1: CSA- classification of the Top 10 ChallengesThe Cloud Security Alliance Big Data Security WorkingGroup has compiled the following as the Top 10security and privacy challenges to overcome in Big Data[4].1. Secure computations in distributedprogramming frameworks2. Security best practices for non-relational datastores3. Secure data storage and transactions logs4. End-point input validation/filtering5. Real-time security monitoring6. Scalable privacy-preserving data mining andanalytics7. Cryptographically enforced data centricsecurity8. Granular access control9. Granular audits10. Data provenance2.2Additional Security WeaknessesThe earlier section regarding Cloud Security Alliance listis an excellent start and this research and papersignificantly adds to it. Where possible, effort has beenmade to map back to the categories identified in the CSAwork. This section lists some additional securityweaknesses associated with Open Source Software(OSS) like Apache Hadoop. It is meant to give the readeran idea of the possible attack surface. However it’s notmeant to be exhaustive which subsequent sections willprovide and add to.Infrastructure Security & Integrity The Common Vulnerabilities and Exposures(CVE) database only shows four reporting andfixed Hadoop vulnerabilities over the past threeyears. Software, even Hadoop, is far fromperfect. This could either reflect that thesecurity community is not active or that most ofvulnerability remediation happens internallywithin the vendor environments themselveswith no public reporting. Hadoop security configuration files are not selfcontained with no validity checks prior to suchpolicies being deployed. This usually results indata integrity and availability issues.The above challenges were grouped into four broadcomponents by the Cloud Security Alliance. They were:Infrastructure Security Secure computations in distributedprogramming frameworks Security best practices for non-relational datastoresData Privacy Scalable privacy-preserving data mining andanalyticsIdentity & Access Management2

enforced and the scheduler might not be able tofind resources next to the data and force it toread data over the network.Role Based Access Control (RBAC) policy filesand Access Control Lists (ACLs) for componentslike MapReduce and HBase are usuallyconfigured via clear-text files. These files areeditable by privileged accounts on the systemlike root and other application accounts.3Big Data Security FrameworkThe following section provides the target securityarchitecture framework for Big Data platform security.The core components of the proposed Big Data SecurityFramework are the following:1. Data Management2. Identity & Access Management3. Data Protection & Privacy4. Network Security5. Infrastructure Security & IntegrityData Privacy & Security All issues associated with SQL injection type ofattacks don’t go away. They move with Hadoopcomponents like Hive and Impala. SQL preparefunctions are currently not available whichwould have enabled separation of the query anddata Lack of native cryptographic controls forsensitive data protection. Frequently, suchsecurity is provided outside the data orapplication stack. Clear-text data might be sent whencommunicatingbetweenDataNodetoDataNode since data locality cannot be strictlyThe above ‘5 pillars’ of Big Data Security Framework arefurther decomposed into 21 sub-components, each ofwhich are critical to ensuring the security and mitigatingthe security risk and threat vectors to the Big Data stack.The overall security framework is shown below.Data ManagementData ClassificationData DiscoveryData TaggingIdentity & Access namenode-to-other mgmt. nodes)AD, LDAP, KerberosData Metering UserEntitlementRBAC AuthorizationServer, DB, Table, View based AuthorizationData Protection & PrivacyData Masking / Data redactionTokenizationField Level / column levelEncryptionDisk level TransparentEncryptionHDFS File/Folder EncryptionData Loss PreventionNetwork SecurityPacket Level EncryptionIn Cluster (namenodejobtacker-datanode)Packet Level EncryptionClient-to-clusterSSL, TLSSSL, TLSPacket Level EncryptionIn Cluster (mapper-reducer)SSL, TLSNetwork SecurityZoningInfrastructure Security & IntegrityLogging / AuditSecure EnhancedLinuxFile Integrity / Data TamperMonitoringFigure 2: Big Data Security Framework3Privileged User & ActivityMonitoring

5.3.1Data ManagementData Management component is decomposed into threecore sub-components. They are Data Classification, DataDiscovery, and Data Tagging.3.1.1Determine impact to the owner of the PII data(e.g. a customer)a. Does the field cause phishing attacks(e.g. email) vs. just replace it (e.g. lossof a credit card)The following figure is a sample representation ofcertain Personally Identifiable Data fieldsData ClassificationEffective data classification is probably one of the mostimportant activities that can in-turn lead to effectivesecurity control implementation in a Big Data platform.When organizations deal with an extremely largeamount of data, aka Big Data, by clearly being able toidentify what data matters, what needs cryptographicprotection among others, and what fields need to beprioritized first for protection, more often than notdetermine the success of a security initiative on thisplatform.Figure 3: Data Classification Matrix3.1.2The following are the core items that have beendeveloped over time and can lead to a successful dataclassification matrix of your environment.1. Work with your legal, privacy office, IntellectualProperty, Finance, and Information Security todetermine all distinct data fields. An openbucket like health data is not sufficient. Thisexercise encourages the reader to go beyondthe symbolic policy level exercise.2. Perform a security control assessment exercise.a. Determine location of data (e.g.exposed to internet, secure data zone)b. Determine number of users andsystems with accessc. Determine security controls (e.g. can itbe protected cryptographically)3. Determine value of the data to the attackera. Is the data easy to resell on the blackmarket?b. Do you have valuable IntellectualProperty (e.g. a nation state looking fornuclear reactor blueprints)4. Determine Compliance and Revenue Impacta. Determinebreachreportingrequirements for all the distinct fieldsb. Does loss of a particular data fieldprevent you from doing business (e.g.card holder data)c. Estimate re-architecting cost forcurrent systems (e.g. buying newsecurity products)d. Other costs like more frequentauditing, fines and judgements andlegal expenses related to complianceData DiscoveryThe lack of situational awareness with respect tosensitive data could leave an organization exposed tosignificant risks. Identifying whether sensitive data ispresent in Hadoop, where it is located and subsequentlytriggering the appropriate data protection measures,such as data masking, data redaction, tokenization orencryption is key. For structured data going into Hadoop, such asrelational data from databases, or, for example,comma-separated values (CSV) or JavaScript ObjectNotation (JSON)-formatted files, the location andclassification of sensitive data may already beknown. In this case, the protection of those columnsor fields can occur programmatically, with, forexample, a labeling engine that assigns visibilitylabels/cell level security to those fields. With unstructured data, the location, count andclassification of sensitive data becomes much moredifficult. Data discovery, where sensitive data can beidentified and located, becomes an important firststep in data protection.The following items are crucial for an effective datadiscovery exercise of your Big Data environment:1. Define and validate the data structure andschema. This is all useful prep work for dataprotection activities later2. Collect metrics (e.g. volume counts, uniquecounts etc.). For example, if a file has 1Mrecords but it is duplicate of a single person, itis a single record vs. 1M records. This is veryuseful for compliance but more importantlyrisk management.4

3.4.5.6.3.1.3Share this insight with your Data Science teamsfor them to build threat models, profiles whichwill be useful in data exfiltration preventionscenarios.If you discover sequence files, work with yourapplication teams to move away from this datastructure. Instead leverage columnar storageformats such as Apache Parquet where possibleregardless of the data processing framework,data mode, or programming language.Build conditional search routines (e.g. onlyreport on date of birth if a person’s name isfound or Credit Card # CVV or CC zip)Account for usecases where once sensitive datahas been cryptographically protected (e.g.encrypted or tokenized), what is the usecase forthe discovery solution.3.2.1Data TaggingDeliver fine-grained authorization through Role BasedAccess Control (RBAC). Manage data access by role (and not user) Determine relationships between users & rolesthrough groups. Leverage AD/LDAP groupmembership and enforce rules across all dataaccess pathsProvide users access to data by centrally managingaccess policies. It is important to tie policy to data and not to theaccess method Leverage Attribute based access control andprotect data based on tags that move with thedata through lineage; permissions decisions canleverage the user, environment (e.g. location),and data attributes. Perform data metering by restricting access todata once a normal threshold (as determined byaccess models machine learning algorithms)is passed for a particular user/application.3.2.2Understand the end-to-end data flows in your Big Dataenvironment, especially the ingress and egressmethods.1.2.3.4.5.3.2User Entitlement Data MeteringIdentify all the data ingress methods in yourBig Data cluster. These would include allmanual (e.g. Hadoop admins) or automatedmethods (e.g. ETL jobs) or those that gothrough some meta-layer (e.g. copy files orcreate write).Knowing whether data is coming in leveragingCommand Line Interface or though some JavaAPI or through Flume or Sqoop import of if it isbeing SSH’d in is important.Similarly, follow the data out and identify allthe egress components out of your Big Dataenvironment.This includes whether reporting jobs are beingrun through Hive queries (e.g. throughODBC/JDBC), through Pig jobs (e.g. readingfiles or Hive tables or HCatalog), or exporting itout via Sqoop or copying via REST API, Hue etc.will determine your control boundaries andtrust zones.All of the above will also help in data discoveryactivity and other data access managementexercises (e.g. to implement RBAC, ABAC, etc.)3.3RBAC AuthorizationData Protection & PrivacyThe majority of the Hadoop distributions and vendoradd-ons package either data-at-rest encryption at ablock or (whole) file level. Application levelcryptographic protection (like n,anddataredaction/masking provide the next level of securityneeded.3.3.1 Application Level Cryptography(Tokenization, field-level encryption)While encryption at the field/element level can offersecurity granularity and audit tracking capabilities, itcomes at the expense of requiring manual interventionto determine the fields that require encryption andwhere and how to enable authorized decryption.3.3.2Transparent Encryption (disk / HDFS layer)Full Disk Encryption (FDE) prevents access via thestorage medium. File encryption can also guard against(privileged) access at the node's operating-system level. In case you need to store and process sensitiveor regulated data in Hadoop, data-at-restencryption protects your organization’sIdentity & Access ManagementPOSIX-style permissions in secure HDFS are the basis formany access controls across the Hadoop stack.5

3.3.3sensitive data and keeps at least the disks out ofaudit scope.In larger Hadoop clusters, disks often need to beremoved from the cluster and replaced. DiskLevel transparent encryption ensures that nohuman-readable residual data remains whendata is removed or when disks aredecommissioned.Full-disk encryption (FDE) can also be OSnative disk encryption, such as dm-crypt1.Data Masking/ Data Redaction5.2.3.4.Data masking or data redaction before load in the typicalETL process de-identifies personally identifiableinformation (PII) data before load. Therefore, nosensitive data is stored in Hadoop, keeping the HadoopCluster potentially out of (audit) scope. This may be performed in batch or real time andcan be achieved with a variety of designs,including the use of static and dynamic datamasking tools, as well as through data services.3.43.4.2Packet level encryption using TLS from theclient to Hadoop clusterPacket level encryption using TLS within thecluster itself. This includes using https betweenNameMode to Job Tracker to DataNode.Packet level encryption using TLS in the cluster(e.g. mapper-reducer)Use LDAP over SSL (LDAPS) rather than LDAPwhen communicating with the corporateenterprise directories to prevent sniffingattacks.Allow your admins to configure and enableencrypted shuffle and TLS/https for HDFS,MapReduce, YARN, HBase UIs etc.Network Security ZoningThe Hadoop clusters must be segmented into points ofdelivery (PODs) with chokepoints such as Top of Rack(ToR) switches where network Access Control Lists(ACLs) limit the allowed traffic to approved levels. Network SecurityThe Network Security layer is decomposed into foursub-components. They are data protection in-transitand network zoning authorization components. 3.4.1 Data Protection In-TransitSecure communications are required for HDFS toprotect data-in-transit. There are multiple threatscenarios that in turn mandate the necessity for httpsand prevent information disclosure or elevation ofprivilege threat categories. Using the TLS protocol(which is now available in all Hadoop distributions) toauthenticate and ensure privacy of communicationsbetween nodes, name servers, and applications. An attacker can gain unauthorized access todata by intercepting communications toHadoop consoles. This could include communication betweenNameNodes and DataNodes that are in the clearback to the Hadoop clients and in turn canresult in credentials/data to be sniffed. Tokens that are granted to the user postKerberos authentication can also be sniffed andcan be used to impersonate users on theNameNode.3.5End users must not be able to connect to theindividual data nodes, but to the name nodesonly.The Apache Knox gateway for example,provides the capability to control traffic in andout of Hadoop at the per-service-levelgranularity.A basic firewall that should allow access only tothe Hadoop NameNode, or, where sufficient, toan Apache Knox gateway. Clients will neverneed to communicate directly with, forexample, a DataNode.Infrastructure Security & IntegrityThe Infrastructure Security & Integrity layer isdecomposed into four core sub-components. They areLogging/Audit, Secure Enhanced Linux (SELinux), FileIntegrity Data Tamper Monitoring, and Privilege

Securing Your Big Data Environment Ajit Gaddam ajit@root777.com Abstract Security and privacy issues are magnified by the volume, variety, and velocity of Big Data. The diversity of data sources, formats, and data flows, combined with the streaming nature of data

Related Documents:

The Rise of Big Data Options 25 Beyond Hadoop 27 With Choice Come Decisions 28 ftoc 23 October 2012; 12:36:54 v. . Gauging Success 35 Chapter 5 Big Data Sources.37 Hunting for Data 38 Setting the Goal 39 Big Data Sources Growing 40 Diving Deeper into Big Data Sources 42 A Wealth of Public Information 43 Getting Started with Big Data .

Data Science & Cybersecurity challenges Two big research directions: 1. Cyber Security through (big) data: Monitoring & Analytics for securing cyber space, e.g. of data flows, data processing in devices, in social networks, dark web, of financial e-transactions 2. Securing Big Data: Big Data for improved decision making, e.g.

6.2.2 Removing Oracle Big Data Appliance from the Shipping Crate 6-4 6.3 Placing Oracle Big Data Appliance in Its Allocated Space 6-6 6.3.1 Moving Oracle Big Data Appliance 6-6 6.3.2 Securing an Oracle Big Data Appliance Rack 6-7 6.3.2.1 Secure the Oracle Big Data Appliance Rack with Leveling Feet 6-8 6.3.3 Attaching a Ground Cable (Optional) 6-8

big data systems raise great challenges in big data bench-marking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target e-

of big data and we discuss various aspect of big data. We define big data and discuss the parameters along which big data is defined. This includes the three v’s of big data which are velocity, volume and variety. Keywords— Big data, pet byte, Exabyte

Retail. Big data use cases 4-8. Healthcare . Big data use cases 9-12. Oil and gas. Big data use cases 13-15. Telecommunications . Big data use cases 16-18. Financial services. Big data use cases 19-22. 3 Top Big Data Analytics use cases. Manufacturing Manufacturing. The digital revolution has transformed the manufacturing industry. Manufacturers

Big Data in Retail 80% of retailers are aware of Big Data concept 47% understand impact of Big Data to their business 30% have executed a Big Data project 5% have or are creating a Big Data strategy Source: "State of the Industry Research Series: Big Data in Retail" from Edgell Knowledge Network (E KN) 6

FoNS guidelines for writing a final project report July 2012 1 Guidelines for writing a final project report July 2012 FoNS has a strong commitment to disseminating the work of the project teams that we support. Developing and changing practice to improve patient care is complex and we therefore believe it is essential to share the outcomes, learning and experiences of those involved in such .