Fairness In Criminal Justice Risk Assessments

3y ago
35 Views
4 Downloads
313.42 KB
42 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Gia Hauser
Transcription

Fairness in CriminalJustice Risk Assessments:The State of the ArtSociological Methods & Research1-42ª The Author(s) 2018Reprints and permission:sagepub.com/journalsPermissions.navDOI: smrRichard Berk1,2, Hoda Heidari3, Shahin Jabbari3,Michael Kearns3, and Aaron Roth3AbstractObjectives: Discussions of fairness in criminal justice risk assessments typically lack conceptual precision. Rhetoric too often substitutes for carefulanalysis. In this article, we seek to clarify the trade-offs between different kindsof fairness and between fairness and accuracy. Methods: We draw on theexisting literatures in criminology, computer science, and statistics to provide anintegrated examination of fairness and accuracy in criminal justice risk assessments. We also provide an empirical illustration using data from arraignments.Results: We show that there are at least six kinds of fairness, some of whichare incompatible with one another and with accuracy. Conclusions: Exceptin trivial cases, it is impossible to maximize accuracy and fairness at the sametime and impossible simultaneously to satisfy all kinds of fairness. In practice,a major complication is different base rates across different legally protectedgroups. There is a need to consider challenging trade-offs. These lessonsapply to applications well beyond criminology where assessments of risk canbe used by decision makers. Examples include mortgage lending, employment, college admissions, child welfare, and medical diagnoses.1Department of Statistics, University of Pennsylvania, Philadelphia, PA, USADepartment of Criminology, University of Pennsylvania, Philadelphia, PA, USA3Department of Computer and Information Science, University of Pennsylvania, Philadelphia,PA, USA2Corresponding Author:Richard Berk, Department of Statistics, University of Pennsylvania, 483 McNeil, 3718 LocustWalk, Philadelphia, PA 19104, USA.Email: berkr@sas.upenn.edu

2Sociological Methods & Research XX(X)Keywordsrisk assessment, machine learning, fairness, criminal justice, discriminationThe use of actuarial risk assessments in criminal justice settings has of latebeen subject to intense scrutiny. There have been ongoing discussions abouthow much better in practice risk assessments derived from machine learningperform compared to risk assessments derived from older, conventionalmethods (Berk 2012; Berk and Bleich 2013; Brennan and Oliver 2013; Liuet al. 2011; Rhodes 2013; Ridgeway 2013a, 2013b). We have learned thatwhen relationships between predictors and the response are complex,machine learning approaches can perform far better. When relationshipsbetween predictors and the response are simple, machine learningapproaches will perform about the same as conventional procedures.Far less close to resolution are concerns about fairness raised by the media(Angwin et al. 2016; Cohen 2012; Crawford 2016; Dieterich et al. 2016;Doleac and Stevenson 2016), government agencies (National Science andTechnology Council 2016:30-32), foundations (Pew Center of the States2011), and academics (Berk 2008; Berk and Hyatt 2015; Demuth 2003;Hamilton 2016; Harcourt 2007; Hyatt, Chanenson, and Bergstrom 2011;Starr 2014b; Tonry 2014).1 Even when direct indicators of protected groupmembership, such as race and gender, are not included as predictors, associations between these measures and legitimate predictors can “bake in”unfairness. An offender’s prior criminal record, for example, can carry forward earlier, unjust treatment not just by criminal justice actors but by anarray of other social institutions that may foster disadvantage.As risk assessment critic Sonja Starr (2014a) writes,While well intentioned, this approach [actuarial risk assessment] is misguided.The United States inarguably has a mass-incarceration crisis, but it is poorpeople and minorities who bear its brunt. Punishment profiling will exacerbatethese disparities—including racial disparities—because the risk assessmentsinclude many race-correlated variables. Profiling sends the toxic message thatthe state considers certain groups of people dangerous based on their identity.It also confirms the widespread impression that the criminal justice system isrigged against the poor. (pp. A17)On normative grounds, such concerns can be broadly legitimate, butwithout far more conceptual precision, it is difficult to reconcile competingclaims and develop appropriate remedies. The debates can become rhetoricalexercises, and few minds are changed.

Berk et al.3This article builds on recent developments in computer science and statistics in which fitting procedures, often called algorithms, can assist criminaljustice decision-making by addressing both accuracy and fairness.2 Accuracyis formally defined by out-of-sample performance using one or more conceptions of prediction error (Hastie, Tibshirani, and Friedman 2009:section7.2). There is no ambiguity. But, even when attempts are made to clarifywhat fairness can mean, there are several different kinds that can conflictwith one another and with accuracy (Berk 2016b).Examined here are different ways that fairness can be formally defined,how these different kinds of fairness can be incompatible, how risk assessment accuracy can be affected, and various algorithmic remedies that havebeen proposed. The perspectives represented are found primarily in statisticsand computer science because those disciplines are the source of modern riskassessment tools used to inform criminal justice decisions.No effort is made here to translate formal definitions of fairness intophilosophical or jurisprudential notions in part because the authors of thisarticle lack the expertise and in part because that multidisciplinary conversation is just beginning (Barocas and Selbst 2016; Ferguson 2015; Janssen and Kuk 2016; Kroll et al. 2017). Nevertheless, an overall conclusionwill be that you can’t have it all. Rhetoric to the contrary, challenging tradeoffs are required between different kinds of fairness and between fairnessand accuracy.Although for concreteness, criminal justice applications are the focus, theissues readily generalize to a very wide range of risk assessment applications.For example, decisions made by banks about whom to grant mortgage loansrest heavily on risk assessments for the chances that the loan will be repaid.Employers commonly do background checks to help determine whether a jobapplicant will be a reliable employee. Child welfare agencies typicallydecide when a minor should be placed in foster care based in part of thepredicted risk from remaining in their current residence.Confusion Tables, Accuracy, and Fairness: A PrologueFor ease of exposition and with no important loss of generality, Y is theresponse variable, henceforth assumed to be binary, and there are twolegally protected group categories: men and women. We begin by introducing by example some key ideas needed later to define fairness and accuracy. We build on the simple structure of a 2 2 cross-tabulation (Berk2016b; Chouldechova 2017; Hardt, Price, and Srebro 2016). Illustrationsfollow shortly.

4Sociological Methods & Research XX(X)Table 1. A Cross-tabulation of the Actual Outcome by the Predicted OutcomeWhen the Prediction Algorithm Is Applied to a Data Set.TruthFailure—a positiveSuccess—a negativeConditional use errorFailure PredictedSuccess PredictedConditionalProcedure ErroraTrue positivescFalse positivesbFalse negativesdTrue negativesb ða þ bÞFalse negative ratec ðc þ dÞFalse positive rateðcþbÞc ða þ cÞb ðb þ dÞðaþbþcþdÞFailure prediction Success prediction Overall procedureerrorerrorerrorTable 1 is a cross-tabulation of the actual binary outcome Y by the predicted binary outcome Y . Such tables are in machine learning often called a“confusion table” (also “confusion matrix”). Y is the fitted values that resultwhen an algorithmic procedure is applied in the data. A “failure” is called a“positive” because it motivates the risk assessment; a positive might be anarrest for a violent crime. A “success” is a “negative,” such as completing aprobation sentence without any arrests. These designations are arbitrary butallow for a less abstract discussion.3The left margin of the table shows the actual outcome classes. The topmargin of the table shows the predicted outcome classes.4 Cell countsinternal to the table are denoted by letters. For example, “a” is thenumber of observations in the upper left cell. All counts in a particularcell have the same observed outcome class and the same predicted outcome class. For example, “a” is the number of observations for which theobserved response class is a failure and the predicted response class is afailure. It is a true positive. Starting at the upper left cell and movingclockwise around the table are true positives, false negatives, true negatives, and false positives.The cell counts and computed values on the margins of the table canbe interpreted as descriptive statistics for the observed values and fittedvalues in the data on hand. Also common is to interpret the computedvalues on the margins of the table as estimates of the correspondingprobabilities in a population. We turn to that later. For now, we justconsider descriptive statistics.There is a surprising amount of descriptive information that can beextracted from the table. We will use the following going forward.5

Berk et al.1.2.3.4.5.6.7.5Sample size—The total number of observations conventionallydenoted by N : a þ b þ c þ d.Base rate—The proportion of actual failures, which isða þ bÞ ða þ b þ c þ dÞ, or the proportion of actual successes,which is ðc þ dÞ ða þ b þ c þ dÞ.Prediction distribution—The proportion predicted to fail and theproportion predicted to succeed: ða þ cÞ ða þ b þ c þ dÞ andðb þ dÞ ða þ b þ c þ dÞ, respectively.Overall procedure error—The proportion of cases misclassified:ðb þ cÞ ða þ b þ c þ dÞ.Conditional procedure error—The proportion of cases incorrectlyclassified conditional on one of the two actual outcomes:b ða þ bÞ, which is the false negative rate, and c ðc þ dÞ, which isthe false positive rate.Conditional use error—The proportion of cases incorrectly predictedconditional on one of the two predicted outcomes: c ða þ cÞ, whichis the proportion of incorrect failure predictions, and b ðb þ dÞ,which is the proportion of incorrect success predictions.6 We use theterm conditional use error because when risk is actually determined,the predicted outcome is employed; this is how risk assessments areused in the field.Cost ratio—The ratio of false negatives to false positives b c orthe ratio of false positives to false negatives c b. When b and care the same, the cost ratio is one, and false positives have sameweight as false negatives. If b is smaller than c, b is more costly.For example, if b ¼ 20 and c ¼ 60, false negatives are three timesmore costly than false positives. One false negative is “worth”three false positives. In practice, b can be more or less costlythan c. It depends on the setting.The discussion of fairness to follow uses all of these features of Table 1,although the particular features employed will vary with the kind of fairness.We will see, in addition, that the different kinds of fairness can be related toone another and to accuracy. But before getting into a more formal discussion, some common fairness issues will be illustrated with three hypotheticalconfusion tables.Table 2 is a confusion table for a hypothetical set of women released onparole. Gender is the protected individual attribute. A failure on parole is a“positive,” and a success on parole is a “negative.” For ease of exposition, thecounts are meant to produce a very simple set of results.

6Sociological Methods & Research XX(X)Table 2. Females: Fail or Succeed on Parole (Success Base Rate ¼ 500/1,000 ¼ .50,Cost Ratio ¼ 200/200 ¼ 1:1, and Predicted to Succeed 500/1,000 ¼ .50).TruthYfail —positiveYsucceed —negativeConditional use errorY failY succeedConditionalProcedure Error300200.40True positivesFalse negativesFalse negative rate200300.40False positivesTrue negativesFalse positive rate.40.40Failure prediction Success predictionerrorerrorThe base rate for success is .50 because half of the women are not rearrested. The algorithm correctly predicts that the proportion who succeed onparole is .50. This is a favorable initial indication of the algorithm’s performance because the marginal distribution of Y and Y is the same.Some call this “calibration” and assert that calibration is an essentialfeature of any risk assessment tool. Imagine the alternative: 70 percent ofwomen on parole are arrest free, but the risk assessment projects that 50percent will be arrest free. The instrument’s credibility is immediately undermined. But calibration sets are a very high standard that existing practicecommonly will fail to meet. Do the decisions of police officers, judges,magistrates, and parole boards perform at the calibration standard? Perhapsa more reasonable standard is that the any risk tool just needs perform betterthan current practice. Calibration in practice is different from calibration intheory, although the latter is a foundation for much formal work on riskassessment fairness. We will return to these issues later.7The false negative rate and false positive rate of .40 are the same forsuccesses and failures. When the outcome is known, the algorithm can correctly identify it 60 percent of the time. Usually, the false positive rate andthe false negative rate are different, which complicates overall performanceassessments.Because here the number of false negatives and false positives is the same(i.e., 200), the cost ratio is 1 to 1. This too is empirically atypical. Falsenegatives and false positives are equally costly according to the algorithm.Usually, they are not.The prediction error of .40 is the same for predicted successes and predicted failures. When the outcome is predicted, the prediction is correct

Berk et al.7Table 3. Males: Fail or Succeed on Parole (Success Base Rate ¼ 500/1,500 ¼ .33,Cost Ratio 400/200 ¼ 2:1, and Predicted to Succeed 700/1,500 ¼ .47).TruthY failY succeedConditionalProcedure ErrorYfail —positive600400.40True positivesFalse negativesFalse negative rateYsucceed —negatives200300.40False positivesTrue negativeFalse positive rateConditional use error.25.57Failure prediction Success predictionerrorerror60 percent of the time. Usually, prediction error will differ between predictedsuccesses and predicted failures.Each of these measures can play a role in fairness assessments. We do notconsider fairness yet because Table 2 shows only the results for women.Fairness is addressed across two or more confusion tables one for eachprotected class.Table 3 is a confusion table for a hypothetical set of men released onparole. To help illustrate fairness concerns, the base rate for success onparole is changed from .50 to .33. Men are substantially less likely to succeedon parole than women. The base rate was changed by multiplying the top rowof cell counts in Table 2 by 2.0. That is the only change made to the cellcounts. The bottom row of cell counts is unchanged.Although the proportion of women predicted to succeed on parole corresponds to the actual proportion of women who succeed, the proportion ofmen predicted to succeed on is a substantial overestimate of the actualproportion of men who succeed. For men, the distribution of Y is not thesame as the distribution of Y . There is a lack of calibration for men. Somemight argue that this makes all the algorithmic results less defensible for menbecause an essential kind of accuracy has been sacrificed. (One would arriveat the same conclusion using predictions of failure on parole.) Fairness issuescould arise in practice if decision makers, noting the disparity between theactual proportion who succeed on parole and the predicted proportion whosucceed on parole, discount the predictions for men, implicitly introducinggender as an input to the decision to be made.The false negative and false positive rates are the same and unchanged at.40. Just as for women, when the outcome is known, the algorithm can

8Sociological Methods & Research XX(X)correctly identify it 60 percent of the time. There are usually no fairnessconcerns when a confusion table measure being examined does not differ byprotected class.Failure prediction error is reduced from .40 to .25, and success predictionerror is increased from .40 to .57. Men are more often predicted to succeed onparole when they actually do not. Women are more often predicted to fail onparole when they actually do not. If predictions of success on parole make arelease more likely, some would argue that the prediction errors lead todecisions that unfairly favor men. Some would assert more generally thatdifferent prediction error proportions for men and women are by itself asource of unfairness.Whereas in Table 2, .50 of the women are predicted to succeed overall, inTable 3, .47 of the men are predicted to succeed overall. This is a smalldisparity in practice, but it favors women. If decisions are affected, somewould call this unfair, but it is a different source of unfairness than disparateprediction errors by gender.Finally, although the cost ratio in Table 2 for women makes false positivesand false negatives equally costly (1 to 1), in Table 3, false positives aretwice as costly as false negatives. Incorrectly classifying a success on paroleas failure is twice as costly for men (2 to 1). This too can be seen as unfair if itaffects decisions. Put another way, individuals who succeed on parole butwho would be predicted to fail are potentially of greater relative concernwhen the individual is a man.It follows arithmetically that all of these potential unfairness and accuracyproblems surface solely by changing the base rate even when the falsenegative rate and false positive rate are unaffected. Base rates can matter agreat deal, a theme to which we will return. Base rates also matter substantially for a wide range of risk assessment settings such as those mentionedearlier. For example, diabetes base rates for Hispanics, blacks, and NativeAmericans can be as much as double the base rates for non-Hispanic whites(American Diabetes Association 2018). One consequence, other thingsequal, would be larger prediction errors for those groups when a diagnosisof diabetes is projected, implying a greater chance of false positives. Theappropriateness of different medical interventions could be affected as aconsequence.We will see later that there are a number of proposals that try to correct forvarious kinds of unfairness, including those illustrated in the comparisonsbetween Tables 2 and 3. For example, it is sometimes possible to tuneclassification procedures to reduce or even eliminate some forms ofunfairness.

Berk et al.9Table 4. Males Tuned: Fail or Succeed on Parole (Success Base Rate ¼ 500/1,500 ¼.33, Cost Ratio ¼ 200/200 ¼ 1:1, and Predicted to Succeed 500/1,500 ¼ .33).TruthYfail —positiveYsucceed —negativeConditional use errorY failY succeedConditionalProcedure Error800200.20True positivesFalse negativesFalse negative rate200300.40False positivesTrue negativesFalse positive rate.20.40Failure prediction Success predictionerrorerrorIn Table 4, for example, the success base rate for men is still .33, but thecost ratio for men is tuned to be 1 to 1. Now, when success on parole ispredicted, it is incorrect 40 times of the 100 and corresponds to .40 successprediction error for women. When predicting success on parole, one hasequal accuracy for men and women. A kind of unfairness has been eliminated. Moreover, the fraction of men predicted to succeed on parole nowequals the actual fraction of men who succeed on parole. There is calibrationfor men. Some measure of credibility has been restored to their predictions.However, the false negative rate for men is now .20, not .40, as it is forwomen. In

Fairness in Criminal Justice Risk Assessments: The State of the Art Richard Berk1,2, Hoda Heidari3, Shahin Jabbari3, Michael Kearns3, and Aaron Roth3 Abstract Objectives: Discussions of fairness in criminal justice risk assessments typi-cally lack conceptual precision. Rhetoric too often substitutes for careful

Related Documents:

fairness, allows for flows to increase its rate if it would not de-creasetherateofanyotherflows[21]. Thus,whenwereferto fairness throughout this paper, we refer to max-min fairness. Although we argue against a fairness-based deployment threshold, fairness measures have many practical uses in the

guments for the stability of justice as fairness that were presented in Part Three of A Theory of Justice. Using the notion of an overlapping consen sus, Part V argues for the stability of justice as fairness as a political concep tion of justice, an idea pursued in Political Liberalism and the more recent works.

US Department of Justice, World Factbook of Criminal Justice Systems, Bureau of Justice Statistics, Washington DC, 1993 MODULE 2 ASPECTS OF COMPARATIVE CRIMINAL POLICY. 6 Systems of Administration of Criminal Justice (Adversarial & Inquisitorial) . Perspectives on Criminal Justice Systems,

-Organized a panel on International Terrorism for criminal justice department, November 2012. -Advised junior students, from 2012 to present. -Member: Criminal Justice Faculty Search Committee 2013. Chair: Criminal Justice Methods Faculty Search Committee 2014. -Member: Criminal Justice General Faculty Search Committee 2014.

Criminal Justice - CJ CJ 493 Undergraduate Research in Criminal Justice Faculty-guided undergraduate research in criminal justice. CJ 494 Criminal Justice Practicum Observation, participation, and study in selected criminal justice agencies. Economics - EC EC 332 Monetary Policy Analysis for Fed Challenge

School of Criminal Justice Dis-tinguished Alumni Award from the University at Albany, State University of New York. The School of Criminal Justice has a well-regarded doctoral program in Criminal Justice. Professor Zalman is a graduate of this pro-gram. Each year, the School of Criminal Justice at the Univer-sity at Albany selects two alumni

3. Articulate and defend differing views on contemporary criminal justice issues. 4. Analyze the sources of political influence over Criminal Justice Policy 5. Use a range of resources to research a contemporary issue in criminal justice 6. Apply criminal justice research methods to current issues in criminal justice Course Textbook:

Blood on the River: James Town 1607 by Elisa Carbone Guide created by Jan Jones Puffin Books 9780142409329, 6.99 Book Description Samuel Collier, a rough and tough young orphan, becomes the page of Captain John Smith as they head for the New World. Brought up in poor conditions and street-smart, Samuel has to learn to control his anger and to use his head instead of his fists. During the .