Introduction To Statistical Disclosure Control (SDC) - IHSN

1y ago

12 Views

2 Downloads

752.90 KB

25 Pages

Last View : 29d ago

Last Download : 3m ago

Upload by : Kaleb Stephen

Report this link

Download PDF

Transcription

IHSNInternational Household Survey NetworkIntroduction to StatisticalDisclosure Control (SDC)www.ihsn.orgMatthias Templ, Bernhard Meindl, Alexander Kowarik and Shuang ChenIHSN Working Paper No 007August 2014

Introduction to Statistical DisclosureControl (SDC)Matthias Templ, Bernhard Meindl, Alexander Kowarik and Shuang ChenAugust 2014IHSN Working Paper No 007i

AcknowledgmentsAcknowledgments: The authors benefited from the support and comments of Olivier Dupriez (World Bank), MatthewWelch (World Bank), François Fonteneau (OECD/PARIS21), Geoffrey Greenwell (OECD/PARIS21), Till Zbiranski(OECD/PARIS21) and Marie Anne Cagas (Consultant), as well as from the editorial support of Linda Klinger.Dissemination and use of this Working Paper is encouraged. Reproduced copies may however not be usedfor commercial purposes.This paper (or a revised copy of it) is available on the web site of the International Household SurveyNetwork at www.ihsn.org.CitationTempl, Matthias, Bernhard Meindl, Alexander Kowarik, and Shuang Chen. “Introduction to StatisticalDisclosure Control (SDC ).” IHSN Working Paper No. 007 (2014).The findings, interpretations, and views expressed in this paper are those of the author(s) and do notnecessarily represent those of the International Household Survey Network member agencies orsecretariat.ii

Table of Contents1Overview.11.12Concepts.12.1What is Disclosure.12.2Classifying Variables. 22.2.1Identifying variables. 22.2.2Sensitive variables. 22.2.3Categorical vs. continuous variables. 22.3343.1Sample uniques, population uniques and record-level disclosure risk. 33.2Principles of k-anonymity and l-diversity. 43.3Disclosure risks for hierarchical data. 43.4Measuring global risks. 43.5Special Uniques Detection Algorithm (SUDA). 53.6Record Linkage. 63.7Special Treatment of Outliers. 6Common SDC Methods. 77Common SDC Methods for Categorical Variables. 74.1.1Recoding. 74.1.2Local suppression. 74.1.3Post-Randomization Method (PRAM).84.26Disclosure Risk vs. Information Loss. 2Measuring Disclosure Risk. 34.15How to Use This Guide.1Common SDC Methods for Continuous Variables.84.2.1Micro-aggregation.84.2.2Adding noise. 94.2.3Shuffling. 9Measuring Information Loss. 95.1Direct Measures. 105.2Benchmarking Indicators. 10Practical Guidelines.116.1How to Determine Key Variables.116.2What is an Acceptable Level of Disclosure Risk versus Information Loss.126.3Which SDC Methods Should be Used.12An Example Using SES Data.127.1Determine Key Variables.137.2Risk Assessment for Categorical Key Variables.13iii

7.3SDC of Categorical Key Variables.137.4SDC of Continuous Key Variables.147.5Assess Information Loss with Benchmarking Indicators.14Acronyms.17References. 18List of tablesTable 1: Example of frequency count, sample uniques and record-level disclosure risksestimated with a Negative Binomial model. 3Table 2: Example inpatient records illustrating k-anonymity and l-diversity. 4Table 3: Example dataset illustrating SUDA scores. 5Table 4: Example of micro-aggregation: var1, var2, var3, are key variables containing originalvalues. var2‘, var2‘, var3’, contain values after applying micro-aggregation. 9List of figuresFigure 1: Disclosure risk versus information loss obtained from two specific SDC methodsapplied to the SES data. 3Figure 2: A workflow for applying common SDC methods to microdata. 11Figure 3: Comparing SDC methods by regression coefficients and confidence intervalsestimated using the original estimates (in black) and perturbed data (in grey).15ListingListing 1: Record-level and global risk assessment measures of the original SES data.13Listing 2: Frequency calculation after recoding.13Listing 3: Disclosure risks and information lost after applying microaggregation (MDAV,k-3)to continuous key variables .14iv

IHSN Working Paper No. 007August 20141. OverviewTo support research and policymaking, there is anincreasing demand for microdata. Microdata are datathat hold information collected on individual units,such as people, households or enterprises. For statisticalproducers, microdata dissemination increases returnson data collection and helps improve data quality andcredibility. But statistical producers are also faced withthe challenge of ensuring respondents’ confidentialitywhile making microdata files more accessible. Not onlyare data producers obligated to protect confidentiality,but security is also crucial for maintaining the trust ofrespondents and ensuring the honesty and validity oftheir responses.Proper and secure microdata dissemination requiresstatistical agencies to establish policies and proceduresthat formally define the conditions for accessingmicrodata (Dupriez and Boyko, 2010), and to applystatistical disclosure control (SDC) methods to databefore release. This guide, Introduction to StatisticalDisclosure Control (SDC), discusses common SDCmethods for microdata obtained from sample surveys,censuses and administrative sources.1.1 How to Use This GuideThis guide is intended for statistical producers atNational Statistical Offices (NSOs) and other statisticalagencies, as well as data users who are interested in thesubject. It assumes no prior knowledge of SDC. Theguide is focused on SDC methods for microdata. It doesnot cover SDC methods for protecting tabular outputs(see Castro 2010 for more details).The guide starts with an introduction to the basicconcepts regarding statistical disclosure in Section 2.Section 3 discusses methods for measuring disclosurerisks. Section 4 presents the most common SDCmethods, followed by an introduction to commonapproaches for assessing information loss and datautility in Section 5. Section 6 provides practicalguidelines on how to implement SDC. Section 7 uses asample dataset to illustrate the primary concepts andprocedures introduced in this guide.All the methods introduced in this guide can beimplemented using sdcMicroGUI, an R-based, userfriendly application (Kowarik et al., 2013) and/orthe more advanced R-Package, sdcMicro (Templ etal., 2013). Readers are encouraged to explore themusing this guide along with the detailed user manualsof sdcMicroGUI (Templ et al., 2014b) and sdcMicro(Templ et al., 2013). Additional case studies of how toimplement SDC on specific datasets are also available;see Templ et al. 2014a.2. ConceptsThis section introduces the basic concepts related tostatistical disclosure, SDC methods and the trade-offbetween disclosure risks and information loss.2.1 What is DisclosureSuppose a hypothetical intruder has access to somereleased microdata and attempts to identify or findout more information about a particular respondent.Disclosure, also known as “re-identification,” occurswhen the intruder reveals previously unknowninformation about a respondent by using the releaseddata. Three types of disclosure are noted here (Lambert,1993): Identity disclosure occurs if the intruderassociates a known individual with a releaseddata record. For example, the intruder links areleased data record with external information,or identifies a respondent with extreme datavalues. In this case, an intruder can exploit asmall subset of variables to make the linkage,and once the linkage is successful, the intruderhas access to all other information in thereleased data related to the specific respondent. Attribute disclosure occurs if the intruderis able to determine some new characteristicsof an individual based on the informationavailable in the released data. For example, if ahospital publishes data showing that all femalepatients aged 56 to 60 have cancer, an intruderthen knows the medical condition of any femalepatient aged 56 to 60 without having to identifythe specific individual. Inferential disclosure occurs if the intruderis able to determine the value of some characteristic of an individual more accurately withthe released data than otherwise would havebeen possible. For example, with a highlypredictive regression model, an intruder maybe able to infer a respondent’s sensitive income1

information using attributes recorded in thedata, leading to inferential disclosure.is a key variable, and SDC methods should be applied toit to prevent identity disclosure.2.2 Classifying Variables2.2.3 Categorical vs. continuousvariables2.2.1 Identifying variablesSDC methods differ for categorical variables andcontinuous variables. Using the definitions in DomingoFerrer and Torra (2005), a categorical variable takesvalues over a finite set. For example, gender is acategorical variable. A continuous variable is numerical,and arithmetic operations can be performed with it. Forexample, income and age are continuous variables. Anumerical variable does not necessarily have an infiniterange, as in the case of age.SDC methods are often applied to identifying variableswhose values might lead to re-identification. Identifyingvariables can be further classified into direct identifiersand key variables: Direct identifiers are variables that unambiguously identify statistical units, suchas social insurance numbers, or names andaddresses of companies or persons. Directidentifiers should be removed as the first stepof SDC.2.3 Disclosure Risk vs. InformationLoss Key variables are a set of variables that,in combination, can be linked to externalinformation to re-identify respondents in thereleased dataset. Key variables are also called“implicit identifiers” or “quasi-identifiers”.For example, while on their own, the gender,age, region and occupation variables maynot reveal the identity of any respondent, butin combination, they may uniquely identifyrespondents.2.2.2Applying SDC techniques to the original microdatamay result in information loss and hence affect datautility1. The main challenge for a statistical agency,therefore, is to apply the optimal SDC techniques thatreduce disclosure risks with minimal informationloss, preserving data utility. To illustrate the trade-offbetween disclosure risk and information loss, Figure1 shows a general example of results after applyingtwo different SDC methods to the European UnionStructure of Earnings Statistics (SES) data (Templ etal., 2014a). The specific SDC methods and measures ofdisclosure risk and information loss will be explained inthe following sections.Sensitive variablesSDC methods are also applied to sensitive variablesto protect confidential information of respondents.Sensitive variables are those whose values must notbe discovered for any respondent in the dataset.The determination of sensitive variables is oftensubject to legal and ethical concerns. For example,variables containing information on criminal history,sexual behavior, medical records or income are oftenconsidered sensitive. In some cases, even if identitydisclosure is prevented, releasing sensitive variablescan still lead to attribute disclosure (see example inSection 3.2).Before applying any SDC methods, the original datais assumed to have disclosure risk of 1 and informationloss of 0. As shown in Figure 1, two different SDCmethods are applied to the same dataset. The solid curverepresents the first SDC method (i.e., adding noise;see Section 4.2.2). The curve illustrates that, as morenoise is added to the original data, the disclosure riskdecreases but the extent of information loss increases.In comparison, the dotted curve, illustrating the resultof the second SDC method (i.e., micro-aggregation; seeSection 4.2.1) is much less steep than the solid curverepresenting the first method. In other words, at a givenlevel of disclosure risk—for example, when disclosurerisk is 0.1—the information loss resulting from thesecond method is much lower than that resulting fromthe first.A variable can be both identifying and sensitive.For example, income variable can be combined withother key variables to re-identify respondents, but thevariable itself also contains sensitive information thatshould be kept confidential. On the other hand, somevariables, such as occupation, might not be sensitive,but could be used to re-identify respondents whencombined with other variables. In this case, occupation12Data utility describes the value of data as an analytical resource,comprising analytical completeness and analytical validity.

.0Therefore, for this specific dataset, Method 2 is thepreferred SDC method for the statistical agency toreduce disclosure risk with minimal information loss.In Section 6, we will discuss in detail how to determinethe acceptable levels of risk and information loss inpractice.10090 worstdisclosive and worst datadata706050method1method2.040300.51information loss1.528020gooddisclosive10119810 7 6 5 40.1030.1520.200.25disclosure riskFigure 1: Disclosure risk versus information loss obtained fromtwo specific SDC methods applied to the SES data3.1 Sample uniques, populationuniques and record-leveldisclosure riskDisclosure risks of categorical variables aredefined based on the idea that records with uniquecombinations of key variable values have higher risksof re-identification (Skinner and Holmes, 1998; Elamirand Skinner, 2006). We call a combination of values ofan assumed set of key variables a pattern, or key value.Let be the frequency counts of records with patternin the sample. A record is called a sample unique if it hasa pattern for which. Let denote the numberof units in the population having the same pattern . Arecord is called a population unique if.In Table 1, a very simple dataset is used to illustratethe concept of sample frequency counts and sampleuniques. The sample dataset has eight records and fourpre-determined key variables (i.e., Age group, Gender,Income and Education). Given the four key variables,we have six distinct patterns, or key values. The samplefrequency counts of the first and second records equal2 because the two records share the same pattern (i.e.,{20s, Male, 50k, High school}). Record 5 is a sampleunique because it is the only individual in the samplewho is a female in her thirties earning less than 50kwith a university degree. Similarly, records 6, 7 and8 are sample uniques, because they possess distinctpatterns with respect to the four key variables.3. Measuring Disclosure RiskDisclosure risks are defined based on assumptions ofdisclosure scenarios, that is, how the intruder mightexploit the released data to reveal information abouta respondent. For example, an intruder might achievethis by linking the released file with another data sourcethat shares the same respondents and identifyingvariables. In another scenario, if an intruder knows thathis/her acquaintance participated in the survey, he/she may be able to match his/her personal knowledgewith the released data to learn new information aboutthe acquaintance. In practice, most of the measuresfor assessing disclosure risks, as introduced below, arebased on key variables, which are determined accordingto assumed disclosure scenarios.Table 1: Example of frequency count, sample uniques and recordlevel disclosure risks estimated with a Negative Binomial isk120sMale 50kHigh school220sMale 50kHigh school2180.017292320sMale 50k0.017High school245.5420sMale 50k0.022High school2390.022530sFemale 50kUniversity1170.177640sFemale 50kHigh school180.297740sFemale 50kMiddle school15410.012860sMale 50kUniversity150.402Consider a sample unique with. Assuming nomeasurement error, there are units in the populationthat could potentially match the record in the sample.The probability that the intruder can match the sampleunique with the individual in the population is thus1/assuming that the intruder does not know if theindividual in the population is a respondent in thesample or not. The disclosure risk for the sample unique3

is thus defined as the expected value of 1/, given. More generally, the record-level disclosure riskfor any given record is defined as the expected value of1/ , given .introduced as a stronger notion of privacy: A group ofobservations with the same pattern of key variablesis l-diverse if it contains at least l “well-represented”values for the sensitive variable. Machanavajjhala etal. (2007) interpreted “well-represented” in a numberof ways, and the simplest interpretation, distinctl-diversity, ensures that the sensitive variable has atleast l distinct values for each group of observationswith the same pattern of key variables. As shown inTable 2, the first three records are 2-diverse becausethey have two distinct values for the sensitive variable,medical condition.In practice, we observe only the sample frequencycounts . To estimate the record-level disclosure risks,we take into account the sampling scheme and makeinferences on assuming that follows a generalizedNegative Binomial distribution (Rinott and Shlomo,2006; Franconi and Polettini, 2004).3.2 Principles of k-anonymity andl-diversity3.3 Disclosure risks for hierarchicaldataAssuming that sample uniques are more likely to be reidentified, one way to protect confidentiality is to ensurethat each distinct pattern of key variables is possessedby at least k records in the sample. This approach iscalled achieving k-anonymity (Samarati and Sweeney,1998; Sweeney, 2002). A typical practice is to set k 3,which ensures that the same pattern of key variables ispossessed by at least three records in the sample. Usingthe previous notation, 3-anonymity meansfor allrecords. By this definition, all records in the previousexample (Table 1) violate 3-anonymity.Many micro-datasets have hierarchical, or multilevel,structures; for example, individuals are situated inhouseholds. Once an individual is re-identified, thedata intruder may learn information about the otherhousehold members, too. It is important, therefore,to take into account the hierarchical structure of thedataset when measuring disclosure risks.It is commonly assumed that the disclosure riskfor a household is greater than or equal to the riskthat at least one member of the household is reidentified. Thus household-level disclosure riskscan be estimated by subtracting the probability thatno person from the household is re-identified fromone. For example, if we consider a single householdof three members, whose individual disclosure risksare 0.1, 0.05 and 0.01, respectively, the disclosurerisk for the entire household will be calculated as1 – (1–0.1) x (1– 0.05) x (1 – 0.01) 0.15355.Even if a group of observations fulfill k-anonymity,an intruder can still discover sensitive information.For example, Table 2 satisfies 3-anonymity, given thetwo key variables gender and age group. However,suppose an intruder gets access to the sample inpatientrecords, and knows that his neighbor, a girl in hertwenties, recently went to the hospital. Since all recordsof females in their twenties have the same medicalcondition, the intruder discovers with certainty thathis neighbor has cancer. In a different scenario, if theintruder has a male friend in his thirties who belongsto one of the first three records, the intruder knows thatthe incidence of his friend having heart disease is lowand thus concludes that his friend has cancer.3.4 Measuring global risksIn addition to record-level disclosure risk measures,a risk measure for the entire file-level or global riskmicro-dataset might be of interest. In this section, wepresent three common measures of global risks:Table 2: Example inpatient records illustrating k-anonymity andl-diversityKey variablesSensitive variable Expected number of re-identifications.The easiest measure of global risk is to sumup the record-level disclosure risks (defined inSection 3.1), which gives the expected numberof re-identifications. Using the example fromTable 1, the expected number of re-identifications is 0.966, the sum of the last column.Distinct l-diversityGenderAge group123MaleMaleMale30s30s30s333Medical conditionCancerHeart DiseaseHeart ancerCancer111 Global risk measure based on log-linearmodels. This measure, defined as the numberTo address this limitation of k-anonymity, thel-diversity principle (Machanavajjhala et al., 2007) was4

IHSN Working Paper No. 007August 2014of sample uniques that are also populationuniques, is estimated using standard log-linearmodels (Skinner and Holmes, 1998; Ichim,2008). The population frequency counts, orthe number of units in the population thatpossess a specific pattern of key variablesobserved in the sample, are assumed to followa Poisson distribution. The global risk can thenbe estimated by a standard log-linear model,using the main effects and interactions of keyvariables. A more precise definition is availablein Skinner and Holmes 1998. Benchmark approach. This measure countsthe number of observations with record-levelrisks higher than a certain threshold andhigher than the main part of the data. Whilethe previous two measures indicate an overallre-identification risk for a microdata file, thebenchmark approach is a relative measurethat examines whether the distribution ofrecord-level risks contains extreme values. Forexample, we can identify the number of recordswith individual risk satisfying the followingconditions:Whererepresents all record-level risks,and MAD( ) is the median absolute deviationof all record-level risks.3.5 Special Uniques DetectionAlgorithm (SUDA)An alternative approach to defining disclosure risksis based on the concept of special uniqueness. Forexample, the eighth record in Table 1 is a sampleunique with respect to the key variable set {Age group,Gender, Income, Education}. Furthermore, a subset ofthe key variable set, for example, {Male, University},is also unique in the sample. A record is defined as aspecial unique with respect to a variable set K , if it isa sample unique both on K and on a subset of K (Elliotet al., 1998). Research has shown that special uniquesare more likely to be population uniques than randomuniques (Elliot et al., 2002).A set of computer algorithms, called SUDA, wasdesigned to comprehensively detect and grade specialuniques (Elliot et al., 2002). SUDA takes a two-stepapproach. In the first step, all unique attribute sets (upto a user-specified size) are located at record level. Tostreamline the search process, SUDA considers onlyMinimal Sample Uniques (MSUs), which are uniqueattribute sets without any unique subsets within asample. In the example presented in Table 3, {Male,University} is a MSU of record 8 because none ofits subsets, {Male} or {University}, is unique in thesample. Whereas, {60s, Male, 50k, University} is aunique attribute set, but not a MSU because its subsets{60s, Male, University} and {Male, University} are bothunique subsets in the sample.Once all MSUs have been found, a SUDA score isassigned to each record indicating how “risky” it is,using the size and distribution of MSUs within eachrecord (Elliot et al., 2002). The potential risk of therecords is determined based on two observations: 1) thesmaller the size of the MSU within a record, the greaterthe risk of the record, and 2) the larger the number ofMSUs possessed by the record, the greater the risk ofthe record.For each MSU of size k contained in a given record,, where M is thea score is computed byuser-specified maximum size of MSUs, and ATT is thetotal number of attributes in the dataset. By definition,the smaller the size k of the MSU, the larger the scorefor the MSU.The final SUDA score for the record is computed byadding the scores for each MSU. In this way, recordswith more MSUs are assigned a higher SUDA score.To illustrate how SUDA scores are calculated,record 8 in Table 3 has two MSUs: {60s} of size 1, and{Male, University} of size 2. Suppose the maximumsize of MSUs is set at 3, the score assigned to {60s} iscomputed by, and the score assigned to{Male, University} isThe SUDA scorefor the eighth record in Table 3 is then 8.Table 3: Example dataset illustrating SUDA scoresAgegroupGenderIncomeEducationSUDAscoreRisk using DISSUDA method1220s20sMaleMale 50k 50kHigh schoolHigh school22000.000.003420s20sMaleMale 50k 50kHigh schoolHigh school22000.000.00530sFemale 50kUniversity180.0149640sFemale 50kHigh school140.0111740sFemale 50kMiddle school160.0057860sMale 50kUniversity180.01495

Introduction to Statistical Disclosure Control (SDC)In order to estimate record-level disclosure risks,SUDA scores can be used in combination with the DataIntrusion Simulation (DIS) metric (Elliot and Manning,2003), a method for assessing disclosure risks for theentire dataset (i.e., file-level disclosure risks). Roughlyspeaking, the DIS-SUDA method distributes thefile-level risk measure generated by the DIS metricbetween records according to the SUDA scores of eachrecord. This way, SUDA scores are calibrated against aconsistent measure to produce the DIS-SUDA scores,which provide the record-level disclosure risk. A fulldescription of the DIS-SUDA method is provided byElliot and Manning (2003).what extent records in the perturbed data file can becorrectly matched with those in the original data file.There are three general approaches to record linkage: Distance-based record linkage (Pagliucaand Seri, 1999) computes distances betweenrecords in the original dataset and the protecteddataset. Suppose we have obtained a protecteddataset A’ after applying some SDC methods tothe original dataset A. For each record r in theprotected dataset A’ ,we compute its distanceto every record in the original dataset, andconsider the nearest and the second nearestrecords. Suppose we have identified r1 and r2from the original dataset as the nearest andsecond-nearest records, respectively, to recordr. If r1 is the original record used to generater, or, in other words,

Structure of Earnings Statistics (SES) data (Templ et al., 2014a). The specific SDC methods and measures of disclosure risk and information loss will be explained in the following sections. Before applying any SDC methods, the original data is assumed to have disclosure risk of 1 and information loss of 0. As shown in Figure 1, two different SDC

Introduction To Statistical Disclosure Control (SDC) - IHSN

It looks like you're using an ad-blocker