Bootstrapping Privacy Compliance In Big Data Systems

3y ago
41 Views
3 Downloads
1.06 MB
16 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Halle Mcleod
Transcription

Bootstrapping Privacy Compliancein Big Data SystemsShayak Sen , Saikat Guha† , Anupam Datta , Sriram K. Rajamani† , Janice Tsai‡ and Jeannette M. Wing‡ CarnegieMellon University, Pittsburgh, USA{shayaks,danupam}@cmu.edu† Microsoft Research, Bangalore, India{saikat,sriram}@microsoft.com‡ Microsoft Research, Redmond, USA{jatsai,wing}@microsoft.comAbstract—With the rapid increase in cloud services collectingand using user data to offer personalized experiences, ensuringthat these services comply with their privacy policies has becomea business imperative for building user trust. However, mostcompliance efforts in industry today rely on manual reviewprocesses and audits designed to safeguard user data, andtherefore are resource intensive and lack coverage. In this paper,we present our experience building and operating a system toautomate privacy policy compliance checking in Bing. Centralto the design of the system are (a) L EGALEASE—a language thatallows specification of privacy policies that impose restrictionson how user data is handled; and (b) G ROK—a data inventoryfor Map-Reduce-like big data systems that tracks how user dataflows among programs. G ROK maps code-level schema elementsto datatypes in L EGALEASE, in essence, annotating existingprograms with information flow types with minimal human input.Compliance checking is thus reduced to information flow analysisof big data systems. The system, bootstrapped by a small team,checks compliance daily of millions of lines of ever-changingsource code written by several thousand developers.I. I NTRODUCTIONWeb services companies, such as Facebook, Google, andMicrosoft, that use personal information of users for various functions are expected to comply with their declaredprivacy policies. Companies in the US are legally requiredto disclose their data collection and use practices, and theUS Federal Trade Commission (FTC) has the mandate toenforce compliance, which it exercises by imposing penaltieson companies found to violate their own stated policies [1],[2], [3]. In practice, these legal requirements translate tocompanies creating review processes and conducting internalaudits to ensure compliance [4], [5]. Manual reviews andaudits are time-consuming, resource-intensive, lack coverage,and, thus, inherently do not scale well in large companies;indeed, there have been cases where internal processes havenot caught policy violations [6]. In this paper we take thefirst steps toward automated checking of large-scale MapReduce-like big data systems for compliance with privacypolicies that restrict how various types of personal informationflow through these systems. Our deployed prototype reducescompliance checking time and improves coverage by orders ofmagnitude, across the data analytics pipeline of Bing. Further,the human resources needs are small — the prototype is runFig. 1. Privacy Compliance Workflow. Manual reviews and audits are highlytime-consuming, and resource-intensive. By encoding policy in L EGALEASEusing the G ROK data inventory, we decouple interactions so policy specification, interpretation, product development, and continuous auditing can proceedin parallel.by a small team, and scales to the needs of several thousandsof developers working on Bing.To contextualize the challenges in performing automatedprivacy compliance checking in a large company with tensof thousands of employees, it is useful to understand thedivision of labor and responsibilities in current complianceworkflows [4], [5]. Privacy policies are typically crafted bylawyers in a corporate legal team to adhere to all applicablelaws and regulations worldwide. Due to the rapid changein product features and internal processes, these policies arenecessarily specified using high-level policy concepts that maynot cleanly map to the products that are expected to complywith them. For instance, a policy may refer to “IP Address”which is a high-level policy concept, and the product mayhave thousands of data stores where data derived from the“IP Address” is stored (and called with different names) andseveral thousand processes that produce and consume thisdata, all of which have to comply with policy. The task ofinterpreting the policy as applicable to individual productsthen falls to the tens of privacy champions embedded inproduct groups. Privacy champions review product featuresat various stages of the development process, offering specificrequirements to the development teams to ensure compliance

with policy. The code produced by the development team isexpected to adhere to these requirements. Periodically, thecompliance team audits development teams to ensure that therequirements are met.We illustrate this process with a running example we usethroughout this paper. Let us assume that we are interestedin checking compliance for an illustrative policy clause thatpromises “full IP address will not be used for advertising.”.The privacy champion reviewing the algorithm design for, sayonline advertisement auctions, may learn in a meeting withthe development team that they use the IP address to infer theuser’s location, which is used as a bid-modifier in the auction.The privacy champion may point out that this program is notcompliant with the above policy and suggest to the development team to truncate the IP address by dropping the last octetto comply with the policy, without significantly degrading theaccuracy of the location inference. The development team thenmodifies the code to truncate the IP address. Periodically,the audit team may ask the development team whether thetruncation code is still in place. Later, the advertising abusedetection team may need to use the IP address. This mayresult in a policy exception, but may come with a differentset of restrictions, e.g., "IP address may be used for detectingabuse. In such cases it will not be combined with accountinformation." The entire process (Fig. 1 left panel) is highlymanual, with each step sometimes taking weeks to identify theright people to talk to and multiple meetings between differentgroups (lawyers, champions, developers, auditors) that may aswell be communicating in different languages.Our central contribution is a workflow for privacy compliance in big data systems. Specifically, we target privacy compliance of large codebases written in languages that supportthe Map-Reduce programming model [7], [8], [9]. This focusenables us to apply our workflow to current industrial-scaledata processing applications, in particular the data analyticsbackend of Bing, Microsoft’s web search engine [10]. Thisworkflow leverages our three key technical contributions: (1)a language L EGALEASE for stating privacy policies, which isusable by policy authors and privacy policy champions, but hasprecise semantics and enables automated checking for compliance, (2) a self-bootstrapping data inventory mapper G ROK,which maps low level data types in code to high-level policyconcepts, and bridges the world of product development withthe world of policy makers, and (3) a scalable implementationof automated compliance checking for Bing. We describe eachof these parts below.The L EGALEASE language. L EGALEASE is an usable, expressive, and enforceable privacy policy language. The primarydesign criteria for this language were that it (a) be usable bythe policy authors and privacy champions; (b) be expressiveenough to capture real privacy policies of industrial-scale systems, e.g., Bing; (c) and should allow compositional reasoningon policies.As the intended users for L EGALEASE are policy authors and privacy champions with limited training in for-mal languages, enabling usability is essential. To this end,L EGALEASE enforces syntactic restrictions ensuring that encoded policy clauses are structured very similarly to policytexts. Specifically, building on prior work on a first orderprivacy logic [11], policy clauses in L EGALEASE allow (resp.deny) certain types of information flows and are refinedthrough exceptions that deny (resp. allow) some sub-types ofthe governed information flow types. This structure of nestedallow-deny rules appears in many practical privacy policies,including privacy laws such the Health Insurance Portabilityand Accountability Act (HIPAA) and the Gramm-Leach-BlileyAct (GLBA) (as observed in prior work [11]), as well asprivacy policies for Bing and Google. A distinctive feature ofL EGALEASE (and a point of contrast from prior work basedon first-order logic and first order-temporal logic [12], [11])is that the semantics of policies is compositional: reasoningabout a policy is reduced to reasoning about its parts. Thisform of compositionality is useful because the effect of addinga new clause to a complex policy is locally contained (an exception only refines its immediately enclosing policy clause).Section III presents the detailed design of the language. Tovalidate the usability of L EGALEASE by its intended users,we conduct a user study among policy writers and privacychampions within Microsoft. On the other hand, by encodingBing and Google’s privacy policies regarding data usage ontheir servers, we demonstrate that L EGALEASE retains enoughexpressiveness to capture real privacy policies of industrialscale systems. Section VI presents the results of the usabilitystudy and the encoding of Bing and Google’s privacy policies.The G ROK mapper. G ROK is a data-inventory for MapReduce-like big data systems. It maps every dynamic schemaelement (e.g., members of a tuple passed between mappersand reducers) to datatypes in L EGALEASE. This inventory canbe viewed as a mechanism for annotating existing programswritten in languages like Hive [7], Dremel [8], or Scope [9]with the information flow types (datatypes) in L EGALEASE.Our primary design criteria for this inventory were that it(a) be bootstrapped with minimal developer effort; (b) reflectexhaustive and up-to-date information about all data in theMap-Reduce-like system; and (c) make it easy to verify (andupdate) the mapping from schema-elements to L EGALEASEdatatypes. The inventory mappings combine information froma number of different sources each of which has its own characteristic coverage and quality. For instance, syntactic analysisof source code (e.g., applying pattern-matching to columnnames) has high coverage but low confidence, whereas explicitannotations added by developers has high confidence but lowcoverage. Section IV details the design of the system, andSection V presents how the automated policy checker performsconservative analysis while minimizing false positives overimperfect mappings.By using automated data-inventory mapping and addingprecise semantics to the policy specification, we reduce timeconsuming meetings by decoupling the interactions betweenthe various groups so policy specification, policy interpreta-

Fig. 2. Example scenario showing a partially-labeled data dependency graphbetween three files and programs.tion, product development, and continuous auditing can proceed in parallel. Since we use automation to bridge code-leveldetails to policy concepts, meetings are needed only when ourautomated privacy compliance checker (conservatively) detectspotentially sensitive scenarios, and are hence more focused,especially on actionable items (dotted lines in Fig. 1).Scale. Our scalability criteria are in (a) the amount of data overwhich we perform automated privacy compliance checking;(b) the time we take to do so; and (c) the number of peopleresources needed for the entire effort. As we quantify inSection VI, our deployed system scales to tens of millionsof lines of source code written by several thousand developersstoring data in tens of millions of files containing over hundredmillion schema-elements, of which a substantial fraction ischanging or added on a day-to-day basis. Our data inventorytakes twenty minutes (daily), and evaluating the completeL EGALEASE encoding of Bing’s privacy policy over the entiredata takes ten minutes. The entire effort was bootstrapped fromscratch by a team of five people.II. M OTIVATING E XAMPLEIn this section, we use an example to highlight salient features of our programming model and typical privacy policiesthat these programs have to respect. These features motivatethe design of our privacy policy language L EGALEASE described in Section III, the data inventory G ROK described inSection IV, and provide intuition on how privacy compliancechecking is reduced to a form of information flow analysis.Consider the scenario in Fig. 2. There are three programs(Jobs 1, 2, 3) and three files (Files A, B, C). Let us assumethat the programs are expected to be compliant with a privacypolicy that says: “full IP address will not be used for advertising. IP address may be used for detecting abuse. In such casesit will not be combined with account information.” Note thatthe policy restricts how a certain type of personal informationflows through the system. The restriction in this example isbased on purpose. Other common restrictions include storagerestrictions (e.g., requiring that certain types of user dataare not stored together) and, for internal policies, role-basedrestrictions (e.g., requiring that only specific product teammembers should use certain types of user data). While our policy language is designed in a general form enabling domainspecific instantiations with different kinds of restrictions, ourevaluation of Bing is done with an instantiation that hasexactly these three restrictions—purpose, role, and storage—on flow of various types of personal information. We interpretinformation flow in the sense of non-interference [13], i.e.,data not supposed to flow to a program should not affect theoutput of the program.The data dependence graph depicted for the example inFig. 2 provides a useful starting point to conduct the information flow analysis. Nodes in the graph are data stores,processes, and humans. Directed edges represent data flowingfrom one node to another. To begin, let us assume thatprograms are labeled with their purpose. For example, Job1 is for the purpose of AbuseDetect. Furthermore, let usalso assume that the source data files are labeled with thetype of data they hold. For example, File A holds dataof type IPAddress. Given these labels, additional labels canbe computed using a simple static dataflow analysis. Forexample, Job 1 and Job 2 both acquire the datatype labelIPAddress since they read File A; File C (and hence Job3) acquires the datatype label IPAddress:Truncated. Given alabeled data dependence graph, a conservative way of checkingnon-interference is to check whether there exists a path fromrestricted data to the program in the data dependence graph.In a programming language such as C or Java, this approachmay lead to unmanagable overtainting. Fortunately, the dataanalytics programs we analyze are written in a restrictedprogramming model without global state and with very limitedcontrol flow based on data. Therefore, we follow precisely thisapproach. Languages like Hive [7], Dremel [8], or Scope [9]that are used to write big data pipelines in enterprises adhereto this programming model (Section IV provides additionaldetails). Note, that for the search engine that we analyze, thedata dependence graph does not come with these kinds oflabels. Bootstrapping these labels without significant humaneffort is a central challenge addressed by G ROK (Section IV).III. P OLICY S PECIFICATION L ANGUAGEWe present the design goals for L EGALEASE, the languagesyntax and formal semantics, as well a set of illustrative policyexamples.A. Design GoalsAs mentioned, we intend legal teams and privacy championsto encode policy in L EGALEASE. Therefore, our primary goalis usability by individuals with typically no training in firstorder or temporal logic, while being sufficiently expressive forencoding current policies.a) Usability: Policy clauses in L EGALEASE are structured very similarly to clauses in the English language policy.This correspondence is important because no single individualin a large company is responsible for all policy clauses;different sub-teams own different portions of the policy and

Policy Clause CDeny Clause D:: :: Allow Clause A:: Attribute TValue v:: :: D ADENY T1 · · · Tn EXCEPT A1 · · · Am DENY T1 · · · TnALLOW T1 · · · Tn EXCEPT D1 · · · Dm ALLOW T1 · · · Tnhattribute-namei v1 · · · vlhattribute-valueiTABLE IG RAMMAR FOR L EGALEASEany mapping from L EGALEASE clauses to English clauses thatdo not fall along these organizational bounds would necessitate (time-consuming) processes to review and update theL EGALEASE clauses. By designing in a 1-1 correspondenceto policies in English, L EGALEASE clauses can be added,reviewed, and updated at the same time as the correspondingEnglish clauses and by the same individuals.b) Expressivity: L EGALEASE clauses are built around anattribute abstraction (described below) that allows the languageto evolve as policy evolves. For instance, policies today tendto focus on access control, retention times, and segregation ofdata in storage, [14], [15], [16]. However, information flowproperties [17] provide more meaningful restrictions on information use. Similarly, the externally-visible policy may be at ahigher level while the internal policy may be more restrictiveand nuanced. L EGALEASE allows transitioning between thesepolicies with minimal policy authoring overhead, and providesenforcement techniques so as to enable stronger public-facingpolicy promises.c) Compositional Reasoning: When the whole policy isstated as a monolithic logic formula, it may be more difficultto naturally reason about the effects of the policy, due tounexpected interactions between different parts of the formula[18]. L EGALEASE provides meaningful syntactic restrictionsto allow compositional reasoning where the result of checkingthe whole policy is a function of reasoning on its parts.B. L EGALEASE Language SyntaxA L EGALEASE policy (Table I) is rooted in a single toplevel policy clause. A policy clause is a layered collectionof (alternating) ALLOW and DENY clauses where each clauserelaxes or constricts the enclosing clause (i.e., each layer defines an exception to the enclosing layer). Each clause containsa set of domain-specific attributes that restrict to which datadependency graph nodes the policy applies. Attributes arespecified by their name, and one or more values. Attributevalues are picked from a concept lattice [19] for that attribute(explained below). The absence of an attribute implies thatthere no restrictions for that attribute. The policy author definesnew attributes by providing an attribute name and a lattice ofvalues. In III-D, we describe the particular instantiation ofattributes we use to specify information flow restrictions onprograms.Checking: L EGALEASE policies are checked at each node inthe data dependency graph. Each graph node is labeled withthe domain-specific attribute name and set of lattice values.For instance, in our setting, we assume that the data inventoryphase labels programs with data that flows to it, a purposeattribute and a user attribute. Informally, an ALLOW clausepermits graph nodes labeled with any subset of the attributevalues listed in the clause, and a DENY clause forbids graphnodes labeled with any set that overlaps with the attributevalues in the clause. The layering of clauses determines thecontext within which each clause is checked. We define theformal evaluation semantics in Section III-E.C. L EGALEASE, by exampleWe illustrate L EGALEASE through a series of examplesthat build up to a complex clause. In the examples weuse two user-defined attributes: DataType and UseForPurpose(our deployment uses two additional ones AccessByRole andInStore). We define the concept lattice for each of these fourattributes in the next subsection.The simplest L EGALEASE policy is DENY. The policycontains a single clause; the clause contains no exceptionsand no attri

To contextualize the challenges in performing automated privacy compliance checking in a large company with tens of thousands of employees, it is useful to understand the . Our central contribution is a workflow for privacy compli-ance in big data systems. Specifically, we target privacy com-

Related Documents:

initions of bootstrapping in existing NLP papers (Komachi et al., 2008; Talukdar and Pereira, 2010; Kozareva et al., 2011). Bootstrapping can greatly reduce the cost of labeling instances, which is espe-cially needed for tasks with high labeling costs. The performance of bootstrapping algorithms, however, depends on the selection of seeds. Al-

Multiple mediation using bootstrapping in SPSS Created by Natalie J. Loxton Page 17. This will open another window to run the script Simply press the "Run" button Multiple mediation using bootstrapping in SPSS Created by Natalie J. Loxton Page 18. Now you can enter all your variables of interest

implementing COBIT 2019 for privacy audit in a live environment, and there are no COBIT 2019 caselet available for privacy compliance implementation. Would-be auditors need to gain competency through case studies. Therefore, there is a need to develop COBIT 2019 caselet focusing on privacy compliance. 1.3 Summary Research Statement

Table 6 Summary of privacy challenges and potential solutions for each Big data layer 44 Table 7 Pseudonymization techniques, their advantages and limitations, and example use cases 47 Table 8 Overview of privacy preserving techniques for Big data storage 48 Table 9 ISO/IEC JTC 1/SC 27 projects related to privacy 56

The DHS Privacy Office Guide to Implementing Privacy 4 The mission of the DHS Privacy Office is to preserve and enhance privacy protections for

U.S. Department of the Interior PRIVACY IMPACT ASSESSMENT Introduction The Department of the Interior requires PIAs to be conducted and maintained on all IT systems whether already in existence, in development or undergoing modification in order to adequately evaluate privacy risks, ensure the protection of privacy information, and consider privacy

marketplace activities and some prominent examples of consumer backlash. Based on knowledge-testing and attitudinal survey work, we suggest that Westin’s approach actually segments two recognizable privacy groups: the “privacy resilient” and the “privacy vulnerable.” We then trace the contours of a more usable

Level 1 1–2 Isolated elements of knowledge and understanding – recall based. Weak or no relevant application to business examples. Generic assertions may be presented. Level 2 3–4 Elements of knowledge and understanding, which are applied to the business example. Chains of reasoning are presented, but may be assertions or incomplete.