Top Ten Big Data Securityand Privacy ChallengesNovember 2012
CLOUD SECURITY ALLIANCE Top Ten Big Data Security and Privacy Challenges 2012 Cloud Security Alliance – All Rights ReservedAll rights reserved. You may download, store, display on your computer, view, print, and link to the CloudSecurity Alliance Big Data Top Ten at http://www.cloudsecurityalliance.org, subject to the following: (a) theDocument may be used solely for your personal, informational, non-commercial use; (b) the Document may notbe modified or altered in any way; (c) the Document may not be redistributed; and (d) the trademark, copyrightor other notices may not be removed. You may quote portions of the Document as permitted by the Fair Useprovisions of the United States Copyright Act, provided that you attribute the portions to the Cloud SecurityAlliance Big Data Top Ten (2012). 2012 Cloud Security Alliance - All Rights Reserved.2
CLOUD SECURITY ALLIANCE Top Ten Big Data Security and Privacy ChallengesContentsAcknowledgments .41.0 Abstract .52.0 Introduction .53.0 Secure Computations in Distributed Programming Frameworks.63.1 Use Cases .64.0 Security Best Practices for Non-Relational Data Stores .64.1 Use Cases .65.0 Secure Data Storage and Transactions Logs .75.1 Use Cases .76.0 End-Point Input Validation/Filtering .76.1 Use Cases .77.0 Real-time Security/Compliance Monitoring .77.1 Use Cases .88.0 Scalable and Composable Privacy-Preserving Data Mining and Analytics .88.1 Use Cases .89.0 Cryptographically Enforced Access Control and Secure Communication .99.1 Use Cases .910.0 Granular Access Control .910.1 Use Cases .911.0 Granular Audits. 1011.1 Use Cases . 1012.0 Data Provenance . 1012.1 Use Cases . 1013.0 Conclusion . 11 2012 Cloud Security Alliance - All Rights Reserved.3
CLOUD SECURITY ALLIANCE Top Ten Big Data Security and Privacy ChallengesAcknowledgmentsCSA Big Data Working Group Co-ChairsLead: Sreeranga Rajan, FujitsuCo-Chair: Wilco van Ginkel, VerizonCo-Chair: Neel Sundaresan, eBayContributorsAlvaro Cardenas Mora, FujitsuYu Chen, SUNY BinghamtonAdam Fuchs, SqrrlAdrian Lane, SecurosisRongxing Lu, University of WaterlooPratyusa Manadhata, HP LabsJesus Molina, FujitsuPraveen Murthy, FujitsuArnab Roy, FujitsuShiju Sathyadevan, Amrita UniversityCSA Global StaffAaron Alva, Graduate Research InternLuciano JR Santos, Research DirectorEvan Scoboria, WebmasterKendall Scoboria, Graphic DesignerJohn Yeoh, Research Analyst 2012 Cloud Security Alliance - All Rights Reserved.4
CLOUD SECURITY ALLIANCE Top Ten Big Data Security and Privacy Challenges1.0 AbstractSecurity and privacy issues are magnified by velocity, volume, and variety of big data, such as large-scale cloudinfrastructures, diversity of data sources and formats, streaming nature of data acquisition, and high volumeinter-cloud migration. Therefore, traditional security mechanisms, which are tailored to securing small-scalestatic (as opposed to streaming) data, are inadequate. In this paper, we highlight top ten big data-specificsecurity and privacy challenges. Our expectation from highlighting the challenges is that it will bring renewedfocus on fortifying big data infrastructures.2.0 IntroductionThe term big data refers to the massive amounts of digital information companies and governments collectabout us and our surroundings. Every day, we create 2.5 quintillion bytes of data—so much that 90% of thedata in the world today has been created in the last two years alone. Security and privacy issues are magnifiedby velocity, volume, and variety of big data, such as large-scale cloud infrastructures, diversity of data sourcesand formats, streaming nature of data acquisition and high volume inter-cloud migration. The use of large scalecloud infrastructures, with a diversity of software platforms, spread across large networks of computers, alsoincreases the attack surface of the entire systemTraditional security mechanisms, which are tailored to securing small-scale static (as opposed to streaming)data, are inadequate. For example, analytics for anomaly detection would generate too many outliers.Similarly, it is not clear how to retrofit provenance in existing cloud infrastructures. Streaming data demandsultra-fast response times from security and privacy solutions.In this paper, we highlight the top ten big data specific security and privacy challenges. We interviewed CloudSecurity Alliance members and surveyed security practitioner-oriented trade journals to draft an initial list ofhigh-priority security and privacy problems, studied published research, and arrived at the following top tenchallenges:1.2.3.4.5.6.7.8.9.10.Secure computations in distributed programming frameworksSecurity best practices for non-relational data storesSecure data storage and transactions logsEnd-point input validation/filteringReal-time security/compliance monitoringScalable and composable privacy-preserving data mining and analyticsCryptographically enforced access control and secure communicationGranular access controlGranular auditsData provenanceIn the rest of the paper, we provide brief descriptions and narrate use cases. 2012 Cloud Security Alliance - All Rights Reserved.5
CLOUD SECURITY ALLIANCE Top Ten Big Data Security and Privacy Challenges3.0 Secure Computations in DistributedProgramming FrameworksDistributed programming frameworks utilize parallelism in computation and storage to process massiveamounts of data. A popular example is the MapReduce framework, which splits an input file into multiplechunks. In the first phase of MapReduce, a Mapper for each chunk reads the data, performs some computation,and outputs a list of key/value pairs. In the next phase, a Reducer combines the values belonging to eachdistinct key and outputs the result. There are two major attack prevention measures: securing the mappers andsecuring the data in the presence of an untrusted mapper.3.1 Use CasesUntrusted mappers could return wrong results, which will in turn generate incorrect aggregate results. Withlarge data sets, it is next to impossible to identify, resulting in significant damage, especially for scientific andfinancial computations.Retailer consumer data is often analyzed by marketing agencies for targeted advertising or customersegmenting. These tasks involve highly parallel computations over large data sets, and are particularly suited forMapReduce frameworks such as Hadoop. However, the data mappers may contain intentional or unintentionalleakages. For example, a mapper may emit a very unique value by analyzing a private record, underminingusers’ privacy.4.0 Security Best Practices for Non-Relational DataStoresNon-relational data stores popularized by NoSQL databases are still evolving with respect to securityinfrastructure. For instance, robust solutions to NoSQL injection are still not mature. Each NoSQL DBs werebuilt to tackle different challenges posed by the analytics world and hence security was never part of the modelat any point of its design stage. Developers using NoSQL databases usually embed security in the middleware.NoSQL databases do not provide any support for enforcing it explicitly in the database. However, clusteringaspect of NoSQL databases poses additional challenges to the robustness of such security practices.4.1 Use CasesCompanies dealing with big unstructured data sets may benefit by migrating from a traditional relationaldatabase to a NoSQL database in terms of accommodating/processing huge volume of data. In general, thesecurity philosophy of NoSQL databases relies in external enforcing mechanisms. To reduce security incidents,the company must review security policies for the middleware adding items to its engine and at the same timetoughen NoSQL database itself to match its counterpart RDBs without compromising on its operational features. 2012 Cloud Security Alliance - All Rights Reserved.6
CLOUD SECURITY ALLIANCE Top Ten Big Data Security and Privacy Challenges5.0 Secure Data Storage and Transactions LogsData and transaction logs are stored in multi-tiered storage media. Manually moving data between tiers givesthe IT manager direct control over exactly what data is moved and when. However, as the size of data set hasbeen, and continues to be, growing exponentially, scalability and availability have necessitated auto-tiering forbig data storage management. Auto-tiering solutions do not keep track of where the data is stored, which posesnew challenges to secure data storage. New mechanisms are imperative to thwart unauthorized access andmaintain the 24/7 availability.5.1 Use CasesA manufacturer wants to integrate data from different divisions. Some of this data is rarely retrieved, whilesome divisions constantly utilize the same data pools. An auto-tier storage system will save the manufacturermoney by pulling the rarely utilized data to a lower (and cheaper) tier. However, this data may consist in R&Dresults, not popular but containing critical information. As lower-tier often provides decreased security, thecompany should study carefully tiering strategies.6.0 End-Point Input Validation/FilteringMany big data use cases in enterprise settings require data collection from many sources, such as end-pointdevices. For example, a security information and event management system (SIEM) may collect event logs frommillions of hardware devices and software applications in an enterprise network. A key challenge in the datacollection process is input validation: how can we trust the data? How can we validate that a source of inputdata is not malicious and how can we filter malicious input from our collection? Input validation and filtering is adaunting challenge posed by untrusted input sources, especially with the bring your own device (BYOD) model.6.1 Use CasesBoth data retrieved from weather sensors and feedback votes sent by an iPhone application share a similarvalidation problem. A motivated adversary may be able to create “rogue” virtual sensors, or spoof iPhone IDs torig the results. This is further complicated by the amount of data collected, which may exceed millions ofreadings/votes. To perform these tasks effectively, algorithms need to be created to validate the input for largedata sets.7.0 Real-time Security/Compliance MonitoringReal-time security monitoring has always been a challenge, given the number of alerts generated by (security)devices. These alerts (correlated or not) lead to many false positives, which are mostly ignored or simply“clicked away,” as humans cannot cope with the shear amount. This problem might even increase with big data, 2012 Cloud Security Alliance - All Rights Reserved.7
CLOUD SECURITY ALLIANCE Top Ten Big Data Security and Privacy Challengesgiven the volume and velocity of data streams. However, big data technologies might also provide anopportunity, in the sense that these technologies do allow for fast processing and analytics of different types ofdata. Which in its turn can be used to provide, for instance, real-time anomaly detection based on scalablesecurity analytics.7.1 Use CasesMost industries and government (agencies) will benefit from real-time security analytics, although the use casesmay differ. There are use cases which are common, like, “Who is accessing which data from which resource atwhat time”; “Are we under attack?” or “Do we have a breach of compliance standard C because of action A?”These are not really new, but the difference is that we have more data at our disposal to make faster and betterdecisions (e.g., less false positives) in that regard. However, new use cases can be defined or we can redefineexisting use cases in lieu of big data. For example, the health industry largely benefits from big datatechnologies, potentially saving billions to the tax-payer, becoming more accurate with the payment of claimsand reducing the fraud related to claims. However, at the same time, the records stored may be extremelysensitive and have to be compliant with HIPAA or regional/local regulations, which call for careful protection ofthat same data. Detecting in real-time the anomalous retrieval of personal information, intentional orunintentional, allows the health care provider to timely repair the damage created and to prevent furthermisuse.8.0 Scalable and Composable Privacy-PreservingData Mining and AnalyticsBig data can be seen as a troubling manifestation of Big Brother by potentially enabling invasions of privacy,invasive marketing, decreased civil freedoms, and increase state and corporate control.A recent analysis of how companies are leveraging data analytics for marketing purposes identified an exampleof how a retailer was able to identify that a teenager was pregnant before her father knew. Similarly,anonymizing data for analytics is not enough to maintain user privacy. For example, AOL released anonymizedsearch logs for academic purposes, but users were easily identified by their searchers. Netflix faced a similarproblem when users of their anonymized data set were identified by correlating their Netflix movie scores withIMDB scores.Therefore, it is important to establish guidelines and recommendations for preventing inadvertent privacydisclosures.8.1 Use CasesUser data collected by companies and government agencies are constantly mined and analyzed by insideanalysts and also potentially outside contractors or business partners. A malicious insider or untrusted partnercan abuse these datasets and extract private information from customers. 2012 Cloud Security Alliance - All Rights Reserved.8
CLOUD SECURITY ALLIANCE Top Ten Big Data Security and Privacy ChallengesSimilarly, intelligence agencies require the collection of vast amounts of data. The data sources are numerousand may include chat-rooms, personal blogs and network routers. Most collected data is, however, innocent innature, need not be retained, and anonymity preserved.Robust and scalable privacy preserving mining algorithms will increase the chances of collecting relevantinformation to increase user safety.9.0 Cryptographically Enforced Access Control andSecure CommunicationTo ensure that the most sensitive private data is end-to-end secure and only accessible to the authorizedentities, data has to be encrypted based on access control policies. Specific research in this area such asattribute-based encryption (ABE) has to be made richer, more efficient, and scalable. To ensure authentication,agreement and fairness among the distributed entities, a cryptographically secure communication frameworkhas to be implemented.9.1 Use CasesSensitive data is routinely stored unencrypted in the cloud. The main problem to encrypt data, especially largedata sets, is the all-or-nothing retrieval policy of encrypted data, disallowing users to easily perform fine grainedactions such as sharing records or searches. ABE alleviates this problem by utilizing a public key cryptosystemwhere attributes related to the data encrypted serve to unlock the keys. On the other hand, we haveunencrypted less sensitive data as well, such as data useful for analytics. Such data has to be communicated in asecure and agreed-upon way using a cryptographically secure communication framework.10.0 Granular Access ControlThe security property that matters from the perspective of access control is secrecy—preventing access to databy people that should not have access. The problem with course-grained access mechanisms is that data thatcould otherwise be shared is often swept into a more restrictive category to guarantee sound security. Granularaccess control gives data managers a scalpel instead of a sword to share data as much as possible withoutcompromising secrecy.10.1 Use CasesBig data analysis and cloud computing are increasingly focused on handling diverse data sets, both in terms ofvariety of schemas and variety of security requirements. Legal and policy restrictions on data come fromnumerous sources. The Sarbanes-Oxley Act levees requirements to protect corporate financial information, andthe Health Insurance Portability and Accountability Act includes numerous restrictions on sharing personalhealth records. Executive Order 13526 outlines an elaborate system of protecting national security information. 2012 Cloud Security Alliance - All Rights Reserved.9
CLOUD SECURITY ALLIANCE Top Ten Big Data Security and Privacy ChallengesPrivacy policies, sharing agreements, and corporate policy also impose requirements on data handling.Managing this plethora of restrictions has so far resulted in increased costs for developing applications and awalled garden approach in which few people can participate in the analysis. Granular access control is necessaryfor analytical systems to adapt to this increasingly complex security environment.11.0 Granular AuditsWith real-time security m
In this paper, we highlight the top ten big data specific security and privacy challenges. We interviewed Cloud Security Alliance members an d surveyed security practitioner-oriented trade journals to draft an initial list of
The purpose of this paper is to highlight the top ten Big Data security and privacy challenges according to practitioners. To do so, the working group utilized a three-step process to arrive at the top challenges in Big Data: 1. The working group interviewed Cloud Security Alliance (CSA) members and surveyed security-practitioner
Retail. Big data use cases 4-8. Healthcare . Big data use cases 9-12. Oil and gas. Big data use cases 13-15. Telecommunications . Big data use cases 16-18. Financial services. Big data use cases 19-22. 3 Top Big Data Analytics use cases. Manufacturing Manufacturing. The digital revolution has transformed the manufacturing industry. Manufacturers
The Rise of Big Data Options 25 Beyond Hadoop 27 With Choice Come Decisions 28 ftoc 23 October 2012; 12:36:54 v. . Gauging Success 35 Chapter 5 Big Data Sources.37 Hunting for Data 38 Setting the Goal 39 Big Data Sources Growing 40 Diving Deeper into Big Data Sources 42 A Wealth of Public Information 43 Getting Started with Big Data .
of big data and we discuss various aspect of big data. We define big data and discuss the parameters along which big data is defined. This includes the three v’s of big data which are velocity, volume and variety. Keywords— Big data, pet byte, Exabyte
big data systems raise great challenges in big data bench-marking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target e-
Big Data in Retail 80% of retailers are aware of Big Data concept 47% understand impact of Big Data to their business 30% have executed a Big Data project 5% have or are creating a Big Data strategy Source: "State of the Industry Research Series: Big Data in Retail" from Edgell Knowledge Network (E KN) 6
India has the second largest unmet demand for AI and Big Data/Analytics, driven primarily by large service providers, GCCs and the start-up ecosystem NCR Others Hyderabad Pune Mumbai Bangalore Chennai Top Skills Talent Big Data/ Analytics 5,800 AI 1,200 Top Skills Talent Big Data/ Analytics 19,100 AI 7.400 Top Skills Talent Big Data/ Analytics .
filter True for user-level API (default is False – admin API) persistent_auth True for using API REST sessions (default is False) . UI Plugin API (Demo) Scheduling API VDSM hooks. 51 UI Plugins Command Line Interface . 52 Web Admin user interface Extend oVirt Web Admin user interface. 53 Web Admin user interface. 54 Web Admin user interface . 55 Web Admin user interface. 56 Web Admin user .