Introduction: A Puzzling Disparity
Over the years, software failures have caused hundreds of deaths and injuries, along with hundreds of billions of dollars in damages. A fairly recent exhibit is the MCAS software of the Boeing 737 Max, a chief cause behind the deaths of 346 people (Lion Air Flight 610 and Ethiopian Airlines Flight 302),1 but the full list is depressingly long and spans several decades. The Northeast Blackout of 2003, caused by a software bug (a race condition) in an alarm system, affected 50 million people in the northeastern US and parts of Canada and contributed to nearly 100 deaths. In 1996, a floating-point overflow error due to a single line of Ada code caused the Ariane 5 Rocket to self-explode, a disaster that cost roughly $370 million. In 1999, losing the Mars Climate Orbiter due to a software bug involving unit conversions cost the taxpayers $551 million (in 2022 dollars). In the 1980s, buggy code in the Therac-25 radiation-therapy machine caused several patient deaths. In 1991, during the Gulf War, a programming bug in the control software of Patriot missiles killed 28 American soldiers and injured about one hundred others. In 2019, a firmware bug caused Lime’s electric scooters to brake suddenly at high speeds, implicating them in hundreds of injuries involving broken collarbones and jaws, “among other serious injuries”.
In 1983, software bugs in Soviet missile detection systems nearly caused a nuclear war when the systems mistakenly reported that five American ICBMs were headed for the Soviet Union; global catastrophe was averted only thanks to the intervention of a lieutenant colonel in the Soviet Air Defense Forces, Stanislav Petrov, who suspected computer error and refrained from following protocol and calling his superiors to report a nuclear strike, which would have automatically triggered a massive retaliatory missile attack against the U.S. In 2012, a bug in Knight Capital Group’s software (a bug that “happened to be a very large software bug”, according to the Knight Capital’s CEO) caused their trading system to go into a frenetic spree of buying and selling stocks that managed to rack up losses of $440 million in less than one hour. Toyota has recalled millions of vehicles since 2009 over unintended acceleration problems due to software bugs, leading to thousands of lawsuits for personal injury or wrongful death; already in 2010 the New York Times reported that at least 93 deaths had been linked to that issue, with hundreds of injuries. Software errors have also long been cited as the chief cause of medical device recalls, of which there are dozens every year, with many implicated in wrongful injuries and/or deaths. The list goes on. It is estimated that “the cost of poor software quality in the US” comes to “at least $2.41 trillion”—a staggering figure by anyone’s lights.
Yet somehow there has never been a collective outcry from pundits and opinion-shaping elites calling for the regulation of software with a view to averting similar disasters. In sharp contrast, the last few years have witnessed an increasingly loud chorus of voices clamoring for the regulation of AI and computational decision making in general, for a hodgepodge of proffered reasons, many of them stemming from ethical concerns about algorithmic bias and discrimination, but also increasingly reflecting fears about nefarious uses of AI, such as the large-scale production and dissemination of disinformation aimed at destabilizing democracies. As one of many examples, just this last October, 86 human and civil rights organizations, including the ACLU, Amnesty International, NAACP, and Human Rights Watch, urged Congress to take action “on the significant human rights and societal risks created and enabled by artificial intelligence (AI) technologies.”
What accounts for the disparity? Looking at the tally, on one side we have a long and meticulously documented trail of grave destruction, both physical and economic. On the other side, starting with the generative-AI camp, we have systems like ChatGPT and Gemini, whose greatest offenses to date are the occasional fabricated facts, as in when ChatGPT hallucinated false information about a radio host in Georgia, who promptly sued OpenAI for reputational damage, or when Gemini generated images of Black Vikings. There are also concerns about intellectual property and copyright, expressed in lawsuits by writers such as Paul Tremblay and Mona Awad, artists like Sarah Silverman (most of her suit was thrown out by courts in November), and more recently outfits like the New York Times, all of which are in uncharted legal territory and have not had any clearly identifiable victims; along with a host of other concerns that I will touch on in what follows, such as the use of LLMs to produce misinformation at scale and the malicious use of AI for crimes such as phishing and hacking (e.g., AI-assisted password cracking), and potentially even for biological or nuclear terrorism. Most of these concerns have remained speculative and have not had any measurable impact so far.2
Much apprehension is also being expressed about non-generative AI systems, which are primarily designed for analysis rather than generation of new content. (The main examples here are classifiers and regressors, but the category is broad and includes recommendation engines, rankers, and anomaly detection systems, among others.) The outputs produced by non-generative AI systems, whether category labels from a classifier or numbers from a regression model, provide insights about the input data and are often interpreted as predictions that are used to inform decision making. The following list is a representative, though not exhaustive, sample of the concerns that are being raised on this front:
Concerns about workplace discrimination stemming from the use of such models in hiring, promotion decisions, and so on.
Concerns about biased credit or mortgage decisions by financial institutions, such as those voiced in the Apple credit card incident described below.
Anxieties over predictive policing and the use of algorithms in law enforcement in general, particularly algorithmic risk assessments in criminal sentencing (as in the infamous COMPAS saga, which will be discussed in detail later).
Concerns about facial recognition technology.
These concerns should not be trivialized. I will be engaging with them earnestly and in depth. But their import must be put into perspective. The societal damage inflicted by AI so far does not even begin to approach the damage that has come about courtesy of dysfunctional software and plain poor software engineering practices. A lot more will be said in due course about all of the issues listed above, but, as a preview, let's briefly consider a couple of them here, starting with automated decisions about credit and lending.
Let’s take as an example a widely publicized incident from 2019 involving the Apple credit card, which is underwritten by Goldman Sachs (although the relationship between the two companies seems to be coming to a close). The incident started on November 7 of that year with a Twitter post by David Heinemeier Hansson, a Danish software developer who created Ruby on Rails. Hansson “vented on Twitter” that even though he and his wife filed joint tax returns, lived in a community-property state, and had been married for a long time, Apple denied her application for a credit line increase. Hansson referred to Apple Card as “a **** sexist program” (expletive elided). Similar complaints by others followed quickly, including one by Apple cofounder Steve Wozniak. Two days later, by November 9, New York State regulators were already on it. Hansson tweeted that it was “wonderful to see regulators who are on the ball” and claimed that “My thread is full of accounts from women who’ve been declared to be worse credit risks than their husbands, despite higher credit scores or incomes.”
Eventually, the New York State DFS (Department of Financial Services) carried out an in-depth investigation of the algorithm used by Goldman Sachs to determine credit limits, including a thorough review of thousands of pages of records, interviews with witnesses and consumers, statistical analysis of the underwriting data used by Goldman Sachs, and a regression analysis of 400,000 Apple Card applications from New York applicants. The results ultimately exonerated Goldman Sachs and Apple by showing that the algorithm in question “did not consider prohibited characteristics of applicants and would not produce disparate impacts.” In other words, there was no discrimination on the basis of sex—neither intentional discrimination (“disparate treatment”) nor unintentional discrimination (“disparate impact”).3 Moreover, in the course of the investigation,
the Department learned it is a common misconception that spouses are entitled to equal credit terms from credit card issuers if they “shared finances.” Like the Consumer on Twitter [Hansson], they expressed the belief that they should have received the same Apple Card offers as their spouses because they shared bank accounts and other assets. [p. 10, emphasis mine]
Contrary to popular belief, however, individual credit scores and credit histories often lead to different outcomes, and for good reasons.4
Granted, this was one isolated incident (if one of epic proportions). The second article in this series will take much more comprehensive stock of statistical decision making in the credit industry in general, including auto loans and mortgages. That discussion will show that the use of algorithms tends to improve transparency and decrease bias. A 2019 Berkeley study, for example, showed that face-to-face lending by humans discriminates significantly more than algorithms. This is consistent with earlier work on the subject, such as this 2002 article by economists who found evidence that algorithmic underwriting “provides substantial benefits to consumers, particularly those at the margin of the underwriting decision,” that such systems “more accurately predict default than manual underwriters do,” and that “this increased accuracy results in higher borrower approval rates, especially for underserved applicants.”
This is a key point I will be stressing repeatedly: Whenever claims are made about an algorithm exhibiting bias or discriminating against a protected group, and those claims are used as grounds for regulatory intervention, the first question that ought to be asked is this: Compared to what? The alternative to algorithmic decision making is almost always human decision making.5 Do we have good reason to expect that humans would be less biased in their decision making? As will be seen shortly, there is plenty of empirical evidence in support of a negative answer. Perfection should not be the enemy of the good. Algorithms don’t need to be platonic paragons of fairness—that’s an impossible standard that no decision-making mechanism of any kind can meet. Algorithms only need to be better than the alternative, humans, and that’s not a high bar to clear. This doesn’t mean that questions of algorithmic fairness should not be investigated, with a view to mitigating remaining biases to whatever extent is feasible, but it does undercut the position that algorithmic decision making (whether by AI systems or other types of algorithms) ought to be subject to stricter regulation.
Let's now turn to algorithmic workplace discrimination as our second brief example. To the best of my knowledge, there has never been a single documented case of such discrimination. Much has been made of Amazon’s gender-biased recruiting tool. Virtually every single academic paper and online discussion of algorithmic fairness mentions it, along with the COMPAS algorithm, as a prime exhibit of the sort of algorithmic bias that demands prompt regulatory oversight. What tends to get left out is that this algorithm was an experimental effort that was never actually used by Amazon recruiters. It was tested and then abandoned. And it was abandoned largely because it was highly inaccurate, above and beyond exhibiting gender bias. Its results were basically noise, “with the technology returning results almost at random.”
In the technology world anyway, I can tell you from extensive first-hand experience as a hiring manager, both in big tech companies and in startups, and also from knowing what happens in other companies in the industry, that the role of AI in recruiting remains practically nil. Hiring and promotion decisions are made by humans, who often agonize over them and discuss them extensively with multiple colleagues over days or weeks. Even screening is done manually, by a blend of recruiters and hiring managers. There is a lot of software out there that helps companies to track candidates, known as ATSs (Automated Tracking Systems), and these are indeed widely used, but that is very different from unilaterally throwing out resumes. The latter is rarely done, at least in the technology sector,6 because recruiters and hiring managers do not want to take the chance of missing out on a strong candidate, and because software wouldn’t be able to do it reliably enough. A lot of systems score resumes against job descriptions and rank them accordingly, almost always via naive keyword-matching methods, but nothing is thrown out, and in practice the scores tend to be ignored by the recruiters.7 Claims about ATSs automatically weeding out applicants are usually fictional, typically made by people peddling books or seminars about “how to beat the ATSs”.
And yet there is a very sizable technical literature on algorithmic employment discrimination, with an even larger body of conversation about it in the mass media. When that content is parsed carefully for empirical evidence of actual algorithmic employment discrimination, the results are underwhelming, for a simple reason: Algorithmic employment discrimination presupposes algorithmic employment decisions, and decision making in the workplace is not algorithmic to any meaningful degree. I will argue that in most other areas of concern as well, the volume and pitch of the technical and public discourse about the risks of AI, and the corresponding perceived need for regulation, are completely out of line with the realities on the ground.
The more general question I would like to explore is this: Why is it that the issue of AI regulation and algorithmic governance in general has aroused such fury and passion, even though, as I hope to show, its actual impact on people’s lives has been relatively limited, while, by comparison, the issue of malfunctioning software has hardly registered a bleep on the public radar, even though it has exacted a much greater toll?
There are many plausible answers to this question, and I will later consider a number of them in detail. For example, it might be natural to suppose that software correctness is a highly technical issue, unlike the overtly ethical considerations surrounding AI regulation. But I don’t find that explanation cogent. Imagine that high-rise buildings were collapsing across the U.S. with alarming frequency. Ultimately, of course, the underlying issues would be highly technical: matters of structural integrity, design standards, materials, maintenance, and so on. However, these issues would inevitably have moral, ethical, political, and regulatory dimensions as well, not unlike those that arise in AI, because the collapse of the buildings would constitute a breach of the social contract and the imperative to protect human life and public safety. There would be accountability questions to be answered, about the duty of care that builders, engineers, inspectors, and policymakers owe to the general public. There would be questions of justice, fairness, and restitution to the victims. We could expect the existing regulatory framework (involving codes, inspection processes, and enforcement of safety standards) to come under scrutiny. And we would naturally expect intense media coverage and a public outcry until there was adequate policy reform.8 In addition, of course, AI itself is highly technical; many of the policy issues that arise in connection with it are intertwined with complex questions about digital systems.
So whatever the right explanation of the disparity might be, it has to involve other factors. At any rate, my main objective will not be to provide such an explanation for its own sake. I am posing this question at the outset only because it will serve as a useful analytical and rhetorical device for framing the discussion, and as a critical lens through which to examine a number of issues in AI ethics and regulation.
A Preview of What Lies Ahead
This is the first in a series of long articles that will delve into a number of complex issues at the intersection of AI technology, ethics, law, and policy. A lot has already been written about these issues, and I will review a good deal of it along the way. Because the topics are complex and controversial, these articles will inevitably raise more questions than they will answer. My primary goal is simply to challenge some assumptions, and indeed to raise some questions. That said, since I am not approaching these issues from an agnostic stance, let me lay a few cards on the table to give you a flavor of what is coming. Here are the main viewpoints I will be defending:
Most calls for AI regulation are well-intentioned but deeply confused and stand to do more harm than good. Regulation should be vertical and focused on specific problems, not horizontal and targeting generic technologies. Even in narrowly focused contexts, regulatory interventions should only be undertaken after careful analysis of pros and cons and deep reflection concerning possible unintended consequences.
In particular, and contrary to the urging of several policy makers and “thought leaders” (especially in Europe), AI regulation should not be enacted on the grounds of the precautionary principle, a muddled policy guideline that is generally problematic but particularly unsuitable for regulating AI, for reasons that will be discussed in detail. There is only a limited range of circumstances, roughly coinciding with the conditions of Rawls’s original position, in which a watered down version of the precautionary principle could be of relevance to AI, and then only in connection with existential risks.
To clarify, I will not be claiming that AI doesn’t present any regulatory challenges. (For example, I will be arguing that some applications of facial recognition technology need to be severely constrained, while others should be altogether banned.) Rather, I will be suggesting that these challenges have been vastly exaggerated, often due to misunderstandings about the underlying technology; that they arise in very specific contexts and ought to be addressed in those narrow contexts; and that the principles that have been invoked to justify horizontal regulatory interventions, most notably variants of the precautionary principle, are profoundly problematic.Existential fears about AI are unfounded and fail to meet any plausible de minimis risk requirement under any reasonable interpretation of the precautionary principle, even under the aforementioned limited circumstances. On a related note, arguments for the so-called AI “singularity” are deeply flawed (to the extent that they are coherent).
Algorithms can improve both the effectiveness and the fairness of important decision-making processes. There is a large amount of evidence, amassed over more than 70 years, demonstrating that algorithms consistently dominate human performance when making decisions under uncertainty. And because we can run algorithms on computers at will, feeding them any inputs we like and obtaining the outputs instantly, consistently, and at scale, their behavior can be scrutinized to a degree that is impossible to match with human decision making. That by itself makes them incomparably more transparent, even when the algorithm is a “black box.” We will be squandering an important opportunity to improve the status quo if we keep demonizing algorithms or making excessive regulatory demands for properties (such as “explainability”) that in most cases are neither technically possible nor necessary.
Of course, algorithms are not a magic bullet. Most of the underlying decision problems are inherently difficult because they require making predictions about future behavior on the basis of highly imperfect—and often subtly biased—information about past behavior, condensed into a few numbers. For that reason, mistakes and injustices will always be made. That, unfortunately, is inevitable. The claim is only that algorithms can be more accurate and less biased than humans.
Beyond a certain point, formal definitions of fairness are otiose. Neither group-based nor individual-based definitions can adequately capture pretheoretic intuitions about justice. Serious problems with Aristotle’s LCM (the “like-cases maxim”), which underpins all individual-based definitions of fairness, have been known for a very long time. And it is not difficult to give counterexamples to any of the more recent group-oriented definitions of fairness.9 Neither can yield substantive social justice across the board, a notion that is inextricably bound up with a tremendously complex array of interlocked socioeconomic and ethical problems that stretch back hundreds of years and remain unresolved. These problems can only be addressed by democratically mandated policymaking and jurisprudence, not by computer scientists moonlighting as philosopher kings. Computer scientists can inform the discussion by showing what is algorithmically feasible and what might be mathematically impossible. Moreover, formal analyses can elucidate ideas, sharpen our thinking, and introduce useful conceptual distinctions. And while the technical AI community has made invaluable contributions on that front, we have long passed the point of diminishing returns. The fundamental fairness problems in decision making under uncertainty, such as affirmative action and structural inequality, are political, legal, and philosophical, not algorithmic. In fact, as legal scholars have recently started to point out, and as will be explained in detail later, many of the bias-mitigation proposals that computer scientists have put forth, seemingly as the spirit has moved them, stand a good chance of being found illegal by the courts, as they subject individuals to facially disparate treatment.
These points are not particularly novel—many others have expressed similar positions. But I hope to make some novel arguments for them, suggest responses to some recent rejoinders that appear to have gone unanswered, and hopefully make a few insightful observations along the way.
Here is a brief roadmap: I will start by charting the stratospheric rise of the field of AI ethics, along with related regulatory concerns and activity; present some initial grounds for skepticism; and highlight the expansionist ethos of the field, while drawing some critical parallels with human rights inflation. I will then start to address in a more focused way some of the specific concerns listed earlier, involving algorithmic decision making in the workplace, in the credit industry, and in judicial settings, as well as issues with facial recognition technology. This initial discussion will start to challenge the dominant narrative about the seriousness of the risks posed by algorithmic decision making.
A subsequent installment will revisit the opening question posed in the introduction and start considering possible answers to it. I will argue that (a) such answers inevitably make a case for what might be called AI exceptionalism, and that (b) the case is weak: It is neither a good explanation of the discrepancy described in the introduction, nor a compelling justification for AI regulation. I will also suggest that such answers ultimately derive most of their resonance from AI’s alleged existential risks, even when they explicitly dismiss those and focus instead on short-term risks. That will take us to a critical appraisal of singularity arguments and an extensive discussion of the precautionary principle and its historical roots in environmental regulation (from which AI stands to learn a lot, even though there are important differences between the two settings). Yet another part will be a detailed discussion of algorithmic fairness; this will be the most technical part of the series. The penultimate piece will discuss legal and philosophical issues in algorithmic decision making. The last and final essay will be a commentary on the current discourse about AI and so-called “disinformation.”
“Everybody is getting in on this”
At the 2023 annual conference of the Society for Human Resources Management, HR attorney Kelly Dobbs Bunting opined: “The robots win. We all die.” She was using the Westworld’s dramatic series ending “as a warning bell for the dangers of artificial intelligence in the workplace.” She added that “all of the new AI tools coming into play are both kind of cool and kind of horrifying” and that “there is a huge tsunami coming of state regulation.” Speaking of the legislation that is being passed, such as New York City’s new Law 144, which requires annual bias audits for employers who use algorithms to assist workplace decision making (e.g., in resume screening), she remarked: “Everybody’s getting in on this.”
She is certainly right about that last part. In his 2020 book AI Ethics, Mark Coeckelbergh writes that “The widely shared intuition that there is an urgency and importance in dealing with the ethical and societal challenges raised by AI has led to an avalanche of initiatives and policy documents” (p. 148). The calls for regulation are coming from a broad coalition of groups and individuals, with diverse motives and rationales, ranging from advisory bodies, professional organizations such as the ACM and IEEE; governmental programs such as the Lords Select Committee, the European Commission’s “High-Level Expert Group on AI”, the White House Office of Science and Technology Policy, and the Advisory Council on the Ethical Use of Artificial Intelligence and Data in Singapore; NGOs; academic initiatives; human rights organizations; non-profits; private individuals from artists and authors to scientists and business executives; the industry; and now, of course, politicians.
Collectively, this extraordinary level of advocacy has exerted a tremendous amount of pressure, playing an instrumental role in the introduction of an exceedingly wide range of both hard and soft laws designed to regulate AI. The term “hard law” refers to regulations implemented as formal, binding legislation. As Gutierrez and Marchant point out, “these processes can entail significant time delays and resources, which limits their responsiveness to emerging issues. On the other hand, there is soft law, which exists in the form of programs that set substantive expectations, but are not directly enforceable by government. Governance of this type can exist without jurisdictions and be developed, amended, and adopted by any entity. Throughout time, soft law has been treated as a preferred approach or delegated as a temporary alternative until hard law is promulgated.”
Already in 2018, “academics and regulators alike” were “scrambling to keep up with the number of articles, principles, regulatory measures and technical standards produced on AI governance.” The above report by Gutierrez and Marchant identified a mind-boggling “634 soft law AI programs”, over 90% of which were established between 2016 and 2019. Many more have seen the light of day since then. About 80% of these 634 programs involved the formulation of “principles,” of which there were already no fewer than 158. For a quick flavor, we have DeepMind’s “Ethics & Society Principles”; Australia’s “AI Ethics Principles”; Canada’s guidelines for “Responsible Use of AI”; the Beijing “AI Principles”; the “Responsible Machine Learning Principles” from “The Institute for Ethical AI & ML”; “Principles for the Governance of AI” by The Future Society; IEEE’s “Ethically Aligned Design” principles; “AI Guiding Principles” from AT&T and “AI Ethics” from Salesforce; “AI Ethics Guidelines” from Finland’s Titio; “AI Principles” from Spain’s Telefonica; “AI Ethics Guidelines” from Japan’s Sony Group; “Guidelines for Artificial Intelligence” from Deutsche Telekom; “AI Principles” from the Future of Life Institute; and so forth. We have the “Montreal Declaration for Responsible AI” (not to be outdone, Toronto published its own manifesto, “Protecting the right to equality and non-discrimination in machine learning systems”); AI For People (not to be confused with AI For The People); we have the “Partnership on AI” among corporations including Amazon, Apple, Meta, and DeepMind; we have “trustworthy AI”, “Ethical AI”, Microsoft’s “AI For Good”, along with a dizzying number of other similar programs bearing equally lofty designations. And while I could not hope to list them all, I would be amiss if I neglected to mention “the Vatican AI Principles” (also known as the “Rome Call for AI Ethics”). Yes, even the Pope “got in on” the AI action.
By way of hard law, there has been a regulatory deluge of 407 AI-related bills across more than 40 states. This January (2024) alone, states introduced a total of 211 AI bills, at an unprecedented rate of roughly 50 new AI-related bills per week, marking an increase of 507% over the previous year, which itself saw “thousands of AI policy pronouncements, proposals, laws, orders and regulations, as well as an avalanche of headlines, talking heads, hearings, conferences, editorials and scholarly publications” [my emphasis]. The regulatory “juggernaut” that is underway has “no precedent for its breadth, depth and speed,” with “thousands of regulatory and judicial proceedings … emerging in hundreds of jurisdictions, making uneven AI regulation almost certain.” Indeed, a 2019 review of 84 sets of ethical AI guidelines found that they had an empty intersection—there was not a single principle featured in all of them. A few general themes, such as transparency and justice, did form a vague recurring motif, in that they appeared in more than half of the 84 documents (see Table 2 of the cited paper), but they had widely diverging interpretations and led to very different recommendations.
Such top-down approaches to AI methodology and regulation, aimed at formulating abstract, decontextualized, and universally applicable principles, are at best well-intentioned Cartesian exercises in futility. Their impact on everyday practice will be about the same as the impact that the ten commandments of Moses have on the ethics of gene editing. At most, they might contribute a loose framework of reference points for ethical exploration and debate, but without any ability to provide tangible directions. Different AI applications across different domains pose highly context-specific and nuanced challenges that cannot be meaningfully addressed by horizontal, overly broad, one-size-fits-all bromides. I hope to show that the preoccupation with enunciating grand talismanic “AI principles” is fostering a culture of regulatory overreach and is leading to coarse and overarching measures that are not only difficult to implement and enforce, but can cause more harm than good.
The Industrialization and Inflation of AI Ethics
Less charitable interpretations are apt to dismiss many of the foregoing initiatives as PR campaigns in disguise, as image management and branding stunts. Worse, blunt declarations like “Everybody’s getting in on this,” the sort of unguarded observation that is typically made at a conference only when the day is over and the participants retreat for drinks and gossip, and depictions of the frenetic engagement with “AI Ethics” as a “gold rush” that is producing a “veritable AI ethics industry” rivaling or surpassing the much older ethics industry in the life sciences, signal what many suspect: that AI regulation is a bandwagon that everyone is jumping onto in a hasty scramble for fear of missing out. They hint that these initiatives are being pursued more for the benefits associated with being on the forefront of a trendy issue rather than from a genuine commitment to addressing the underlying challenges; and that the championing of AI regulation might serve as a strategic move, a way to align oneself with what appears to be a prevailing wind in the sociocultural landscape. So it should not come as a surprise that much of this activity has been deservedly met with skepticism, and occasionally with amusing cynicism, even by scholars who work squarely in the field of AI Ethics, some of whom have criticized many of these AI principles as “vacuous material that is little more than “virtue signaling,” … with empty displays of ethical probity”, such as clarion calls for AI to “benefit humanity” and pronouncements that are “so broad as to veer toward the meaningless” (see Paula Boddington’s incisive essay on “Normative Modes” in the Oxford Handbook of AI Ethics, p. 129).10
These critiques have done little to stem the tide of ethical AI principles, which are continuing to grow not only in volume but also in scope, voraciously broadening their territorial reach by encompassing a much wider and more diverse set of mandates, with an expansionist fervor that evokes something of a manifest regulatory destiny. Indeed, many papers are framing such expansionism as an imperative strategic direction, as in “A Call to Expand the AI Ethics Toolkit”, which submits that AI’s “ethical attention” to date has been myopic, having only “looked at a fairly narrow range of questions about expanding the access to, fairness of, and accountability for existing tools” and that scholars must instead “develop much broader questions of [sic] about the reconfiguration of societal power, for which AI technologies form a crucial component” by “using approaches from feminist theory, organization studies, and science and technology.” In that vein, the second principle of the Montreal Declaration requires that “the development and use of artificial intelligence systems (AIS) must permit the growth of the well-being of all sentient beings,” a statement that is noteworthy in that it quantifies not over human beings but over all sentient beings—a nod to thinkers like Peter Singer, who have long argued that the very idea of human rights is speciesist and should be abandoned in favor of a more general conception of animal rights.
This is made explicit in other papers, which, for example, argue that “the field of artificial intelligence (AI) ethics needs to give more attention to the values and interests of nonhumans, such as other biological species and the AI itself,” bemoaning the fact that “the field of AI ethics gives limited and inconsistent attention to nonhumans” [my italics]. Other authors contend that the most important direction that AI Ethics must pursue is environmentalism, arguing, for example, that “a new approach to AI ethics is needed and that “sustainability”, properly construed, can provide that new approach,” an approach that would mark a “paradigm shift” in the policy space of AI Ethics, breaking with current practices that “woefully overlook” the technology’s ecological impacts on “vulnerable and marginalized communities” [my italics]. Along those lines, one of OECD’s AI principles demands that stakeholders must “proactively engage in responsible stewardship of trustworthy AI in pursuit of beneficial outcomes for people and the planet,” such as “protecting natural environments.”
A chapter in the 2024 Handbook on the Ethics of Artificial Intelligence posits that there is “a pressing need to “queer” the ethics of AI” [my italics] in order “to challenge and reevaluate the normative suppositions and values that underlie AI systems,” such as “binarism” and “dualism”, ways of thinking that have “impregnated many constructions throughout history,” including engineering, which displays “a general inclination for using conceptual dichotomies, like classified/unclassified, yes/no, male/female, concrete/abstract, and reductionist/holistic”—“false dichotomies” that “often have negative consequences when applied to AI,” e.g., if “an AI system is trained to classify people as either male or female, the algorithm may not accurately recognize or include people who identify as non-binary or gender non-conforming” and such “misgendering can have various consequences,” including “receiving [social media] adverts geared to a different audience”. Echoing that sentiment, other authors report that facial analysis technologies are “universally unable to classify non-binary genders,” a failure that “privileges cisnormativity,” while others argue that binarism “collapses gender” and “reduces the potential for AI to reflect gender fluidity and self-held gender identity.”
Yet others believe that feminist theory is the key ingredient for AI Ethics and decry “big data projects that are characterized by masculinist, totalizing fantasies of world domination as enacted through data capture and analysis,” and which “ignore context, fetishize size, and inflate their technical and scientific capabilities” (from Data Feminism, p. 151), while other authors discussing “feminist approaches to who’s responsible for ethical AI” make a number of “recommendations based on feminist theory,” advising, for example, that “organisations should move from a static model of responsibility to a dynamic and ethically motivated response-ability.” Likewise, another paper in the 2022 AAAI/ACM Conference on AI, Ethics, and Society warns that “feminist philosophers have criticized the local scope of the paradigm of distributive justice and have proposed corrective amendments to surmount its limitations” and purports to show that “algorithmic fairness does not accommodate structural injustices in its current scope,” a problem which “requires widening the scope of algorithmic fairness to include power relations, social dynamics and actors and structures which are among the main sources of the emergence and persistence of social injustices relevant to algorithmic systems” [my italics]. And following an initial flurry of research activity in the ML community on group-based fairness metrics, which we will review thoroughly later on, increasing amounts of attention are being devoted to “the intersectional harms that result from interacting systems of oppression.” The 2023 paper “Factoring the Matrix of Domination: A Critical Review and Reimagination of Intersectionality in AI Fairness” reviews no fewer than 30 papers from the AI Fairness literature that discuss intersectionality, which they define as “a critical framework that, through inquiry and praxis, allows us to examine how social inequalities persist through domains of structure and discipline,” and argue that “adopting intersectionality as an analytical framework is pivotal to effectively operationalizing fairness” in AI [my italics].
These examples are only meant to illustrate the aforementioned expansionism and its associated rhetorical excesses. They are not meant to question the legitimacy of the cited concerns, only the couching of the need to address these concerns as a moral imperative for AI Ethics. Regardless of what one might think of intersectionality in the context of critical social theory, for example, there are intriguing theoretical questions to be asked about the interplay between group-based notions of algorithmic fairness and intersections of protected groups, and interesting results have been obtained there (although the interpretation of such results and their broader implications for AI ethics remain open to debate). And it is of course true that AGR (automatic gender recognition) technology is premised on the assumption that gender can be determined on the basis of phenotypical characteristics, which is indeed fundamentally incompatible with the basic tenet of trans identity (which is essentially the negation of that premise), although it must be kept in mind that AGR is only one part of the landscape of facial recognition technology.
In some ways, the proliferation of AI Ethics principles mirrors human rights expansionism—unsurprisingly, perhaps, given the intimate connection between ethics and human rights. There has been a persistent debate in the human rights literature for many decades now over a number of key questions, such as: What exactly are human rights? How many of them are there, or how many should be officially acknowledged? Are they all on equal footing? The UDHR (Universal Declaration of Human Rights), adopted by the UN in 1948 (and chaired by Eleanor Roosevelt), strongly influenced by the then-recent experience of World War II, introduced 30 articles enunciating “basic rights and fundamental freedoms,” some of which are inherently negative (such as the right not to be enslaved or tortured) and others, such as the right to life, which were commonly understood to be negative when they first started out their career but then gradually became increasingly positive.11
But even from that early point there was controversy revolving around socioeconomic rights (also known as “welfare rights” or “second-generation rights”12), such as the controversy around the infamous Article 24, which lays it down that paid holidays are a human right,13 or Article 25, which stipulates that “Everyone has the right to a standard of living adequate for the health and well-being of himself and of his family, including food, clothing, housing and medical care and necessary social services, and the right to security in the event of unemployment, sickness, disability, widowhood, old age or other lack of livelihood in circumstances beyond his control.” As the years went by, the set of human rights or proposed human rights kept expanding, e.g., the human right to health came to include the “right to prevention, treatment and control of diseases” and the opportunity to enjoy not mere health but “the highest attainable level of health,” requirements that entail routine massive human rights violations across the globe, in developed and developing countries alike. Others have proposed that children have a human right to be loved; that there is a right to Internet access; that people have the right to sell their body organs and to have access to assisted suicide. There have been calls “to assert a novel right to a human decision”— because “machine-learning tools are perceived to be eclipsing, even extinguishing, human agency,” an idea that has been “embraced” by “European law” such as the GDPR.14 And, naturally, there have been calls to “transition” from human rights to “a biocentric approach” focused on the “rights of nature.”
In his landmark book On Human Rights, James Griffin vividly illustrates the historical trajectory of human rights by sketching the journey of one particular right, the “right to life,” as follows (p. 97):
In the seventeenth century most of the proponents of a right to life seem to have conceived of it negatively—as a right not to be deprived of life without due process. But since then, the supposed content of the right has broadened, and lately has positively ballooned: from a right against the arbitrary termination of life, to a right to rescue, to a right to protection of anything deemed to be covered by the term “sanctity of life”, including a right against the prevention of life (so against euthanasia, abortion, sterilization, etc.—a use of the right made by many “pro-life” campaigners), to a right to a fairly modest basic welfare provision, all the way up to a right to a fully flourishing life. And that last extension clearly goes too far.
This is what I meant earlier about a right starting out its career firmly in the negative camp but then gradually becoming positive. That is, of course, in addition to the slew of brand new human rights that have been claimed, a small sample of which were listed above. Indeed, it seems that everything that might make for a better life can be and increasingly is couched in the language of human rights, and it is not clear where the line might ever be drawn.15 The concern is that inflation renders talk about human rights vacuous and devalues the “currency” of the concept. It also makes human rights de facto unenforceable, reducing them to little more than wish lists. The situation is similar in AI Ethics, where much of the literature commits the fallacy of confusing a worthy cause with a positive right that ought to impose moral or legal obligations.
Meehl Patterns and AI Concerns vs. AI Impact
Algorithmic Employment
As discussed earlier, decision making in the workplace remains manual, and hence talk of algorithmic employment discrimination is speculative (even though, based on the coverage that the subject has received, one would be excused for believing it to be rampant in practice). But let’s assume, for the sake of argument, that we live in another universe in which recruiting, for example, has been largely automated, and that it is now routine for resumes to be rejected or approved by software. Would that justify the introduction of regulation to ensure that the software in question is not biased? What purpose would such regulation serve?
Many countries, including the U.S., already have legal frameworks in place to protect people against discriminatory workplace practices. Anyone who believes they were discriminated against on the basis of a protected attribute (race, gender, ethic origin, sexual orientation, religious affiliation, and so on) already has recourse to governmental channels, such as the U.S. Equal Employment Opportunity Commission (EEOC), that can be used to lodge official complaints. In 2022, the EEOC received tens of thousands of charges (73,485, to be exact) alleging workplace discrimination. About 23% of those involved discriminatory hiring practices. Crucially, EEOC regulation is largely reactive, not proactive. They investigate complaints but do not routinely perform audits.
It is not clear why proactive regulation is needed to specifically scrutinize the use of algorithms or AI models in workplace decision making. New York City was the first to introduce such legislation recently, known as Local Law 144 (passed in 2021 and augmented with a number of rules last year), whose main upshot was threefold: Companies using AI software for hiring purposes (under a very loose definition of the term “AI”) are required to notify candidates that they are being evaluated by an automated system; the candidates have a right to know what data attributes are being collected and used to evaluate them; and most importantly, the software must be subjected to annual bias audits, conducted by “independent auditors.”
But if the ultimate objective is to ensure that a company’s hiring practices are not discriminatory, then why not proactively regulate those hiring practices regardless of whether they involve AI? Is there any evidence to suggest that humans are less biased than automated systems, and hence not in need of proactive auditing? In the absence of such evidence, why is only one decision-making mechanism being singled out for proactive regulation? On its face, this disparity betrays a misguided focus on how a certain result is generated (on the nature of the underlying decision-making process) rather than on what that result actually is (on the ultimate outcome of that process), i.e., on procedural rather than substantive issues.
What evidence does exist for bias in workplace decisions unambiguously points to humans as the source, not to algorithms. Human bias has yet to be eliminated from the hiring process, despite U.S. federal law that protects equal employment opportunity (Title VII of the 1964 Civil Rights Act), and despite the fact that, for decades now, HR departments across countless companies have introduced guidelines to ensure that hiring practices adhere to these regulations. There is an extensive body of research documenting a large array of biases exhibited by human recruiters. Perhaps the best-known representative of that line of work is the 2003 NBER16 paper “Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination,” which found that “white names receive 50 percent more callbacks for interviews” and that “the amount of discrimination is uniform across occupations and industries. Federal contractors and employers who list ‘Equal Opportunity Employer’ in their ad discriminate as much as other employers.” Human recruiters also discriminate on the basis of age, gender, and other protected characteristics, such as religion.17
One might think that the “Emily and Greg” paper is 20 years old at this point, and that surely things have improved by now. Unfortunately, that’s not the case. A 2017 meta-analysis of all correspondence studies on hiring discrimination that took place between 2005 and 2016, 19 of which were conducted in the U.S., found that “the overwhelming majority of the studies report negative treatment effects (i.e. discrimination of the group hypothesized to be discriminated against).” And newer studies have confirmed persistent hiring discrimination in Western countries. For instance, a 2021 paper in Nature studied “the online recruitment platform of the Swiss public employment service” and found “that rates of contact by recruiters are 4–19% lower for individuals from immigrant and minority ethnic groups, depending on their country of origin, than for citizens from the majority group. Women experience a penalty of 7% in professions that are dominated by men, and the opposite pattern emerges for men in professions that are dominated by women.” Another 2021 study “drawing on a field experimental data from five European countries” analyzed “the responses of employers (N = 13,162) to applications from fictitious candidates of different origin” and found “ethnic discrimination and a female premium.” And an even more recent and comprehensive 2023 PNAS study examined racial and ethnic discrimination trends in hiring by analyzing results from 90 field experiments, ranging from 1969 until 2019 and involving a total of 174,000 job applications across six Western countries (Canada, France, Germany, Great Britain, the Netherlands, and the United States). They found that “levels of discrimination in callbacks have remained either unchanged or slightly increased overall for most countries and origin categories.”
Studies that explicitly compare human bias in processes like hiring against algorithmic bias are few, largely because, as pointed out earlier, algorithms are not widely used in practice, so available data is scarce. However, the few studies that exist, and which tend to rely on ad hoc algorithms trained for research purposes, consistently give the advantage to algorithms, even when these are trained on biased historical data. See, for instance, this 2020 paper, which concludes that
the algorithm increases hiring of non-traditional candidates. In addition, the productivity benefits come from these candidates. This includes women, racial minorities, candidates without a job referral, graduates from non-elite colleges, candidates with no prior work experience, candidates who did not work for competitors.
The author attributes the model’s reduced bias to noisy data, which is quite common, and writes that the results make
optimistic predictions about the impact of machine learning on bias, even without extensive adjustments for fairness. Prior research cited throughout this paper suggests that noise and bias are abundant in human decision-making, and thus ripe for learning and debiasing through the theoretical mechanisms in the model.
Other research has shown that algorithms can help to reduce gender bias in hiring. These results are consistent with work in other areas. In the judicial setting, for example, Kleinberg et al. studied the problem of deciding whether to grant a defendant bail and showed that algorithms can deliver significant social benefits:
Even accounting for these concerns [concerns about biased data], our results suggest potentially large welfare gains: a policy simulation shows crime can be reduced by up to 24.8% with no change in jailing rates, or jail populations can be reduced by 42.0% with no increase in crime rates. Moreover, we see reductions in all categories of crime, including violent ones. Importantly, such gains can be had while also significantly reducing the percentage of African-Americans and Hispanics in jail. We find similar results in a national dataset as well.
Crucially, algorithms can not only reduce discrimination, but they typically reduce prediction errors as well: They tend to be more accurate than humans in most cases and at least as accurate in the remaining cases. (In technical terms, their performance dominates human decision making under uncertainty.) I will say a lot more about this in the section after the next, but intuitively, the key reason for both improvements (in accuracy and in bias) is the same: Algorithms shut out irrelevant, extraneous considerations, such as appearance, which tend to have a strong effect on human decision makers. Judges, for example, who pride themselves on their objectivity, have been shown to impose harsher criminal sentences on Black Americans with more prominent Afrocentric features, which the authors of the cited 2004 article define as “those physical features that are perceived as typical of African Americans (e.g., dark skin, wide nose, full lips).” Another article from 2017 surveyed “the empirical research that assesses whether judges live up to the standards of their profession” and reported that judges “rely heavily on intuitive reasoning in deciding cases, making them vulnerable to the use of mental shortcuts that can lead to mistakes. Furthermore, judges sometimes rely on facts outside the record and rule more favorably towards litigants who are more sympathetic or with whom they share demographic characteristics.” Evidence has repeatedly shown that judges impose longer prison sentences to Black Americans than to White Americans for similar crimes. Bail decisions, too, are known to be racially biased.18 Discussing loan decisions made by bank managers, Dutta points out that “A large part of human decision making is based on the first few seconds and how much [the decisionmakers] like the applicant. A well-dressed, well-groomed young individual has more chance than an unshaven, disheveled bloke of obtaining a loan from a human credit checker.” The same can be said about hiring processes that rely on human judgments, which are notoriously susceptible to a wide spectrum of biases, both cognitive and non-cognitive: Groupthink (aka the “bandwagon effect”), recency bias, the halo effect, the leniency effect, the similarity attraction bias, the contrast effect, overconfidence bias, the horn effect, confirmation biases, stereotypes, as well as unconscious biases (e.g., related to an applicant’s weight) can and do affect human decision making, not just in hiring but also in college admissions, in evaluating loan eligibility, and so on—anywhere humans make decisions, essentially. Large tech companies like Amazon require their interviewers to undergo extensive and rigorous training aimed at overcoming biases (I had to undergo that training), but I am not aware of any research on the effectiveness of such programs.19
Ultimately, many of the issues stem from the fact that the human brain has been molded over many thousands of years of evolution to depend on rapid, instinctive, quick-and-dirty “System I” heuristics. As Bertrant and Dufflo put it in their 2017 Field Experiments on Discrimination, p. 378:
Ambiguous or unfamiliar situations tend to be associated with System 1: without concrete criteria for decision-making, individuals will rely on the information that is most easily accessible, including stereotypes, to make decisions (Dovidio and Gaertner, 2000; Johnson et al., 1995). Emotional states such as anger or disgust have also been shown to induce more bias against minority group members, even if those emotions were not triggered by the minority group members themselves or directly related to the decision-making situation (DeSteno et al., 2004; Dasgupta et al., 2009). Interesting, even happiness has been shown to produce more stereotypic judgments, though the exact mechanism for this is unclear (Bodenhausen et al., 1994). Importantly, states of fatigue, time pressure, heavy workload, stress, emergencies, or distraction also trigger more System 1 reasoning and more stereotypic judgments (Eells and Showalter, 1994; Hartley and Adams, 1974; Keinan, 1987; Van Knippenberg et al., 1999; Bodenhausen and Lichtenstein, 1987; Gilbert and Hixon, 1991; Sherman et al., 1998).
The use of algorithms blocks out not only the vast array of biases that come courtesy of having a human brain, which make decision makers susceptible to hasty judgments influenced by factors that ought to be irrelevant (including, incidentally, the decision maker’s ideology20); but it also blocks out irrelevant factors that come courtesy of having a body—factors like low glucose levels, or not having gotten enough sleep.21 Algorithms don’t have stomachs or arteries, they don’t have motivational or emotional biases because they don’t have motivations or emotions, they don’t have cognitive biases because they don’t have any beliefs, and they are not physically situated and cannot be influenced by random environmental stimuli. Yes, some biases potentially can still be indirectly inherited via the data used to train a model, but, as pointed out above, the resulting models tend to display less bias, even if the bias is not completely eliminated.
There has been much talk about algorithms amplifying biases that might exist in a training dataset, meaning (rather simplistically) that if a training dataset D shows a degree x of bias concerning a protected attribute A, then a model trained on D might exhibit a corresponding bias at a degree y > x, because the model might come to exaggerate the importance of A (or other features that are highly correlated with it) for inferring the value of the target random variable. This notion is somewhat nebulous, because, typically, when we talk about a process amplifying a variable, we are observing the same quantity either in two different states or in two different distributions; whereas here we are comparing a static property of a dataset D (the A-bias inherent in D) vs. the behavior of a model trained on D and evaluated on some test set (presumably from the same distribution that generated D), which are two fundamentally different types of things. Putting that aside, it is worth pointing out that the notion was introduced and has since been studied predominantly in the setting of visual recognition tasks such as image classification, captioning, and visual semantic role labeling, where the protected attribute is typically gender, not in the distributive-justice-like context of the decision problems concerning us here. The extent to which dataset biases might be amplified by models that tackle more conventional decision problems is unknown, as is the relationship between bias amplification in such settings (if any) and various types of fairness metrics. Recall that sufficient noise in the training data can largely remove biases (as argued by Cogwill), and note that standard visual recognition datasets like MS-COCO, which have been used to study bias amplification, have high-quality annotations and relatively simple and unambiguous classification problems, which means that the noise in such datasets is likely to be much lower than the noise of a training dataset for hiring or for criminal sentencing, where subjectivity and inconsistency are much more prevalent. (As the late Daniel Kahneman points out (pp. 224-225): “humans are incorrigibly inconsistent in making summary judgments of complex information. When asked to evaluate the same information twice, they frequently give different answers. The extent of the inconsistency is often a matter of real concern. … The widespread inconsistency is probably due to the extreme context dependency of System 1. We know from studies of priming that unnoticed stimuli in our environment have a substantial influence on our thoughts and actions. These influences fluctuate from moment to moment. The brief pleasure of a cool breeze on a hot day may make you slightly more positive and optimistic about whatever you are evaluating at the time. … Because you have little direct knowledge of what goes on in your mind, you will never know that you might have made a different judgment or reached a different decision under very slightly different circumstances. Formulas do not suffer from such problems.”)
Even in the setting of visual recognition tasks, it has been hypothesized recently that “bias amplification may depend on the difficulty of the classification task relative to the difficulty of recognizing group membership: bias amplification appears to occur primarily when it is easier to recognize group membership than class membership”—in other words, if the value of the protected attribute A is easier to infer compared to the value of the target variable, as indeed tends to be the case in visual recognition tasks with protected attributes like gender. In other settings, protected attributes might not be as readily derivable, which could also help to explain why bias was not only not amplified but was actually reduced in the cases cited earlier. Two last related points worth making: First, because bias amplification and accuracy tend to be negatively correlated, striving for higher accuracy will naturally reduce bias amplification (see p. 2 of this paper). Second, when it happens, bias amplification can be mitigated by conventional techniques, such as regularization and procuring additional training data, without intrusive interventions either in the loss function or in the training data.
Bias in Other Domains
Consider college and graduate school admissions as another example. There are indeed plenty of guidelines and regulations in place intended to ensure fairness, both at the level of federal and state law and at the institutional level. Do they succeed? The short answer is no. Most of the biases mentioned above can and do infect admission processes. And this does not take into account biases injected earlier into the process (e.g., by decisions such as where to recruit students), nor the largest biases of all—those involving money, biases that are exhibited either openly (as in admission policies involving the children of major donors, alumni and faculty members) or covertly.
How about decision making in health care? There, too, we are faced with a wave of cognitive and emotional/sociocultural biases that have significant adverse impact on patient care and for which there is no accountability or regulation. Cognitive pitfalls such as anchoring bias, sunk-cost bias, confirmation bias, base-rate bias, hindsight bias, and adaptive bias “influence how questions are framed, how risk factors and health status are measured, how decisions are made, and what actions are implemented.”22 According to a 2018 study of cognitive bias in clinical medicine, “cognitive error is pervasive in clinical practice. Up to 75% of errors in internal medicine practice are thought to be cognitive in origin, and errors in cognition have been identified in all steps of the diagnostic process, including information gathering, association triggering, context formulation, processing and verification.”23 In radiology alone, studies from as early as the 1940s found that
CXRs of patients with suspected tuberculosis were read differently by different observers in 10-20% of cases. In the 1970s, it was found that 71% of lung cancers detected on screening radiographs were visible in retrospect on previous films. The ‘average’ observer has been found to miss 30% of visible lesions on barium enemas. A 1999 study found that 19% of lung cancers presenting as a nodular lesion on chest x-rays were missed. Another study identified major disagreement between 2 observers in interpreting x-rays of patients in an emergency department in 5-9% of cases, with an estimated incidence of errors per observer of 3-6%. A 1997 study using experienced radiologists reporting a collection of normal and abnormal x-rays found an overall 23% error rate when no clinical information was supplied, falling to 20% when clinical details were available. A recent report suggests a significant major discrepancy rate (13%) between specialist neuroradiology second opinion and primary general radiology opinion.24
Disagreements between two radiologists on how to interpret the same image have been consistently found to range from 20% to 40% (see the 2021 paper Radiologists and Clinical Trials: Part 1 The Truth About Reader Disagreements).
In fact, radiologists disagree with themselves a good deal of the time, in that they may come to one conclusion at a given point in time and then upon re-examining the image at a later time they might come to a different conclusion contradicting the first one; intra-disagreement rates tend to be lower than inter-disagreement rates but significant nevertheless, usually hovering around 20%. Indeed, as early as 1959, in a seminal article in AJR (American Journal of Roentgenology) entitled Studies on the accuracy of diagnostic procedures, L. Henry Garland, a physician who pioneered the research program of examining diagnostic errors, analyzed a 1944 study comparing the diagnostic value of various radiologic recording media for the detection of pulmonary tuberculosis and found intra- and inter-individual agreement rates to be 21% and 30%, respectively, results that have been “affirmed and reaffirmed by hundreds of researchers in the ensuing half century.”25 Occasionally disagreement rates exceed 50%, as in a group of Harvard University radiologists who disagreed on the interpretation of chest radiographs as much as 56% of the time.26
On the sociocultural front, research has shown, for example, that “a substantial number of … medical students and residents hold false beliefs about biological differences between blacks and whites”, beliefs that correlate “with racial disparities in pain assessment and treatment recommendations”. Medical providers are susceptible to ageism,27 view heavy patients “significantly more negatively” and spend less time with them, and believe that women exaggerate their pain symptoms and are more likely to treat their pain as a mental-health condition rather than a physical one, which often results in delayed diagnoses for women by comparison to men. Such biases are not harmless; they are strongly associated with substandard care. Researchers in 2017 carried out a systematic review of published findings on implicit bias in health care, examining 42 articles, and concluded that there is “a significant positive relationship between level of implicit bias and lower quality of care.” A consensus study report by the Institute of Medicine of the National Academies in 2003 noted that
a large body of published research reveals that racial and ethnic minorities experience a lower quality of health services, and are less likely to receive even routine medical procedures than are white Americans. Relative to whites, African Americans—and in some cases, Hispanics—are less likely to receive appropriate cardiac medication (e.g., Herholz et al., 1996) or to undergo coronary artery bypass surgery (e.g., Ayanian et al., 1993; Hannan et al., 1999; Johnson et al., 1993; Petersen et al., 2002), are less likely to receive peritoneal dialysis and kidney transplantation (e.g., Epstein et al., 2000; Barker-Cummings et al., 1995; Gaylin et al., 1993), and are likely to receive a lower quality of basic clinical services (Ayanian et al., 1999) such as intensive care (Williams et al., 1995), even when variations in such factors as insurance status, income, age, co-morbid conditions, and symptom expression are taken into account. Significantly, these differences are associated with greater mortality among African-American patients (Peterson et al., 1997; Bach et al., 1999)” [my italics].
In the financial sector, loan decisions made by humans are likewise swayed by conscious and unconscious biases. Loan officers are susceptible to peer bias, for example, where shared characteristics (such as gender) make a loan officer process financial information in a way that is more favorable to the applicant. Moreover, financial incentives, such as commission-based compensation, have been shown to cause loan officers to approve too many risky loans. A 2019 paper showed that “Black and Hispanic applicants’ car-loan approval rates are 1.5 percentage points lower than white applicants’, even when controlling for creditworthiness. In aggregate, this discrimination leads to over 80,000 minorities failing to secure loans each year” [my emphasis]. Interestingly, the same paper attributes this to the fact that auto loans involve personal interaction, unlike credit-card decisions, which are typically “made using statistical algorithms that provide less opportunity for direct discrimination.” Indeed, the authors found that “on average, the same minority applicant who faced lower approval rates on auto loans does not face lower approval rates on credit cards, during the same year. This finding suggests that the human element of auto lending, rather than actual differences in creditworthiness, leads to the lower approval rates for minorities” [author’s emphasis].28 Similar results have been obtained in the case of small-business loans. Women in Italy were shown to pay more for credit. A study of 14 loan officers in an Israeli bank showed that the officers regarded their “gut feelings” as “more valid indicators of the worthiness of the application than were the relevant financial data”. Recall the quote cited earlier about how “a well-dressed, well-groomed young individual has more chance than an unshaven, disheveled bloke of obtaining a loan from a human credit checker.”
Explainability and Accountability
A common rejoinder here is that, unlike humans, algorithms operate at scale and therefore their biases can have much more widespread harmful effects. In addition, algorithms are often opaque and lack proper mechanisms for appealing their decisions, whereas human bureaucracies have instituted norms that require explainability, enforce accountability, and allow for appeals. I will have a lot more to say specifically about explainability and accountability later, but let’s briefly consider these points here, starting with bias.
As we have seen, algorithms can reduce human bias in crucial decision-making domains. But proponents of horizontal AI regulation often claim that because technology can be used at scale, biased or otherwise harmful algorithms can somehow have much broader impact than human decision makers. (The related assertion that algorithms can amplify biases found in their training data is a more technical claim that was already discussed above.) Now, there is some irony in this claim; considering that the most colossal disasters in history, stretching from centuries of incessant wars, slavery, and genocides to more recent developments such as the Iraq debacle, have been the results of human decisions, often made by very small numbers of decision makers (the “leaders”) who rule the fates of billions of people, scaling issues don’t seem to be holding back our destructive decision making. But let’s put that aside. While it’s true, of course, that algorithms are fast, relatively cheap, and tireless, which is largely why they are introduced in the first place, blanket claims about their potential “scale” are often false. If a company implements an automated system to screen resumes, that will not suddenly cause millions of people to submit resumes to that particular company. The impact of the new system will be roughly the same as the impact of whatever manual—and thus very likely biased—screening system was in place before. Likewise, if a bank starts using an AI model to evaluate mortgage applications, that will not magically generate massive numbers of new loan applicants; assessing recidivism risk by algorithm will not increase the number of defendants whose recidivism risk needs assessment; and so on. The “blast radius” of a new decision-making system is often fixed, or at least bounded, by exogenous factors and constraints that are orthogonal to the internals of the system.
The other argument thread concedes that humans may be individually biased, but counters that human decision making is often institutionalized, and this bureaucratization has built safeguards to ensure that decisions that affect the public are not hijacked by any one individual’s whims or prejudices.29 In particular, human-driven decision-making in arenas such as law, health care, the labor market, and so on, is explicitly guided by collectively formulated rules and requirements for explainability and accountability.
Unfortunately, these claims about human institutions are Pollyannaish and not supported by the facts. Institutional or structural biases are just as pervasive as individual biases and even more entrenched, and indeed end up fueling and reinforcing individual biases. Witness, for example, the history of redlining and Jim Crow in the U.S.,30 and its subsequent transformation into what has been dubbed “predatory inclusion”. Moreover, human-driven decision making at the institutional level, especially when carried out under the authority or protection of governmental fiat, is usually not only opaque and impervious to explanation but also distinctly prone to post hoc rationalization and institutional corruption, while appeals are often impossible. Most life-altering decisions (whether to admit someone to college, extend them credit, send them to jail, give them access to a medical test or treatment, and so on) are ultimately left to the discretion of individuals who are under no obligation to explain their reasoning. These individuals may be functioning as parts of an institutional bureaucracy (a college, a bank, a court system, a hospital, etc.) and be bound by guidelines and regulations in theory, but in practice such constraints have little to no impact. In fact, institutional norms and practices often end up enabling obfuscation, all while paying lip service to transparency and accountability.
Consider judicial decisions, for example. Even though the U.S. legal system has a large number of appellate courts whose very purpose is to allow for appeals and to correct errors committed by trial courts, in practice appellate decisions usually end up rubberstamping whatever decisions were previously made by lower courts.31 Of course, there are multiple explanations for this phenomenon, one being that courts get things right the first time around with more than 90% accuracy, another being that developments such as the adoption of the “harmless error” doctrine32 have enabled appellate courts to cover up the errors of trial judges. The subject has been extensively discussed—and contested—in legal circles.33 Explainability has not fared better. Roughly until the 1960s, “reason giving” was indeed considered to be an essential duty of the judicial system, crucial for ensuring that the legal system remains open to the public, transparent, and accountable, and as a barrier against judicial corruption and abuse. Starting roughly in the early 1970s,34 it became increasingly common to replace written court decisions with so-called “unpublished” opinions that are “generally unpolished and less carefully crafted” and whose text is “sparse, containing only a minimal recitation of the facts and a limited description of the law,”35 a change that “threatens the values of fairness and openness that are central to our procedural system” and “particularly disadvantages members of vulnerable groups.”36 While these decisions are published in places like Lexis and West Law, they cannot be cited and are not precedential (they cannot set legal precedent). Unpublished opinions “give US State and Federal judges, and also, de facto, judicial clerks, staff attorneys employed by the courts, and (as in Colorado and New York) bureaucrats, power to declare judicial decisions of little or no precedential value and in some cases either to make them disappear from the public record or to abort them.”37 As a result, “judges may act arbitrarily or use the practice [of unpublished opinions] to avoid publicly issuing difficult, complicated or unpopular decisions,” whereas “providing a public, written opinion for each decision provides litigants and society the power to monitor the development and application of the law.”38
Take job applications as another example, an area where there are indeed plenty of institutional norms. Yet explainability and accountability for hiring decisions, for example, are virtually non-existent. Recruiters routinely reject large numbers of resumes without ever having to justify a single rejection. While HR staff might periodically review internal hiring practices, they tend to focus on aggregate statistics with a view to monitoring broad-based hiring metrics and trends (e.g., to see if hiring practices adhere to the four-fifths rule), not on individual decisions. For those candidates who are lucky enough to secure in-house interviews, interview feedback may be written up and recorded as a rationale for the final hiring decision, but usually this is perfunctory and consistent with biased hiring practices, whether the bias is explicit or implicit. Accountability only becomes an issue in those rare cases where a rejected applicant feels that they were discriminated against and decides to pursue legal action (which, of course, would be an option even if the decision was made by an algorithm). Individual screening and/or hiring rejections are practically never explained, justified, or defended, particularly in the private sector, for a number of reasons.39
The situation is similar in college and graduate school admissions. In most cases, the decision of whether to admit a college applicant is made “holistically,”40 based on a mixture of quantitative and qualitative considerations, and is ultimately up to humans (sometimes just one) who hold considerable leeway in determining the outcome, without any need for explanation or accountability. As Jonathan Cole, a professor and former provost at Columbia University, put it:
Admission often depends on which person in the admissions committee reads your application; what their biases are, their presuppositions; whether they’ve had a bad egg-salad sandwich that day or read too many applications. These are all things that enter our decision-making process as human beings.
“It is [a lottery],” he said, “but no one is willing to admit it.” Nor are there any mechanisms for appealing the results of the lottery.
In health care, holding medical providers accountable for their decisions falls to state licensing boards and the legal justice system and is restricted to egregious cases of malpractice or misconduct.41 Nor are there any available processes for obtaining transparent explanations of medical decisions,42 apart from a patient simply asking for an explanation, which might or might not be forthcoming. In some cases an ostensible explanation might be more of a rationalization. For instance, a primary-care doctor’s decision to refer a patient to a specialist could be motivated by hospital policies designed to keep lucrative referrals in-house. Should the patient ask for the rationale behind the referral, they are likely to receive a different set of reasons. Likewise, a physician practicing defensive medicine who takes a certain action or refrains from it for fear of getting sued is very likely to rationalize her action differently.
So while there may be formal mechanisms in place, such as Internal Affairs bureaus inside police departments, that are supposed to hold individuals accountable for their decisions, these institutional checks rarely function as they are intended to work in theory.43
Institutional biases are more pernicious because they are harder to challenge, as they are often cloaked in the legitimacy of established bureaucracies and legal doctrines. In fact, they often end up being protected by the very legal system that is supposed to be the people’s bulwark against injustice. Consider, for instance, the way in which the Supreme Court has repeatedly undermined Section 1983 since the late 1960s. Section 1983 was first enacted in 1871 as part of the Ku Klux Klan Act, largely with the intention of stopping the harassment of newly freed slaves by the government. Historically, Section 1983 was instrumental in holding public officials (such as police officers or tax collectors) accountable for violating the constitutional and statutory rights of citizens, particularly under the Fourteenth Amendment. Starting in 1967,44 the Supreme Court45 introduced a legal doctrine known as qualified immunity, which protects officials from liability even when they act unlawfully, as long as their actions don’t violate “clearly established” law. In practice, this means that a citizen cannot hold government officials accountable unless there is an existing legal precedent with essentially identical facts. The original 1967 decision had a clear and fairly defensible rationale: It was intended to deter excessive or baseless litigation against government officials, which could discourage them from discharging their duties, and to protect them from the financial burdens of such litigation. It also required the officials to demonstrate that they were acting in good faith. But over the years the doctrine has hardened into a blunt instrument that those in power can wield to their advantage in abusive ways.46 The “good faith” requirement, for instance, was discarded in 1982. Qualified immunity has been roundly criticized by an unusually wide and diverse spectrum of voices, including law professors,47 civil rights organizations, commentators on the left and the libertarian right, conservative interest groups like the Alliance Defending Freedom, and even Supreme Court justices on opposite ends of the ideological spectrum.48 The critics rightly contend that the doctrine undermines accountability and creates a significant barrier to justice for victims of governmental abuses, effectively placing a protective barrier around public officials that can seem insurmountable. Nevertheless, qualified immunity persists. Its history illustrates the official protection afforded to institutional biases, how daunting it is to eradicate those, and the need for genuine structural societal reforms that can only be brought about through determined political action, not through research on algorithmic fairness.
Judgments Under Uncertainty: Humans vs. Algorithms
As early as 1958, scholars were observing that
studies have repeatedly shown that impressionistic predictions of human behavior, even by highly qualified staff, are erroneous more frequently than are predictions based on statistical analysis. That is why statistical prediction methods are preferred, whenever practicable, in insurance and in other businesses which evaluate errors of behavior prediction in dollars and cents.
It was four years before that, in 1954, that the famous psychologist Paul Meehl had published what was to become a classic—if controversial—book: Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence. Meehl’s seminal work was rooted in psychology, and the term “clinical” harkens back to its use in medical and psychological evaluations. But by the 1950s the scope of the phrase “clinical judgment” had already expanded to encompass a style of decision making that relies on expert judgment, holistic assessment, and individualized consideration, regardless of the problem domain. Today, “clinician” is virtually interchangeable with “expert” and “clinical judgment” with “human expert judgment.” Clinical judgments are contrasted with algorithmic or mechanical judgments, which are often called statistical judgments in the literature.49 Why “statistical”? Because the typical setting is a prediction problem where a random variable capturing some aspect of human behavior is modeled as a function of other behavioral traits that have already been observed (the “input features,” in machine-learning terminology, or “independent variables” in more classical statistical parlance), and the function is parameterized over a number of weights whose values are induced from historical data. Multiple regression models are the quintessential examples of this approach, of course, but in this context “statistical” is a term of art that is effectively synonymous with “algorithmic.” Thus, an algorithmic system is also known as a statistical system. A decision-making process that relies exclusively on humans is sometimes called a judgmental system.
Meehl’s “disturbing little book,” as he called it, led to a prolonged debate in psychology and the social sciences in general, giving rise to hundreds of publications and inviting heated polemical exchanges. But by now the debate is generally considered settled firmly in favor of Meehl. A 2000 comprehensive meta-analysis of more than 100 studies comparing human and mechanical predictions of everything from college academic performance and surgical outcomes to criminal recidivism concluded that data-driven statistical analyses “substantially outperformed” subjective human judgments. In a 2007 paper, Grove and Loyd wrote:
there may well be reasoning processes that clinicians sometimes use that a formula, table, or computer program cannot precisely mimic. However, whether such reasoning actually helps clinicians dependably outperform statistical formulas and computer programs is an empirical question with a clear, convincing answer: No, for prediction domains thus far studied. The burden of proof is now squarely on clinicians’ shoulders to show, for new or existing prediction problems, that they can surpass simple statistical methods in accurately predicting human behavior. (p. 194)
Chapter 21 of Kahneman’s Thinking Fast and Slow is aptly titled “Intuitions vs. Formulas” and contains a masterful discussion of Meehl’s work and its legacy. Kahneman writes:
Meehl reviewed the results of 20 studies that had analyzed whether clinical predictions based on the subjective impressions of trained professionals were more accurate than statistical predictions made by combining a few scores or ratings according to a rule. In a typical study, trained counselors predicted the grades of freshmen at the end of the school year. The counselors interviewed each student for forty-five minutes. They also had access to high school grades, several aptitude tests, and a four-page personal statement. The statistical algorithm used only a fraction of this information: high school grades and one aptitude test. Nevertheless, the formula was more accurate than 11 of the 14 counselors. Meehl reported generally similar results across a variety of other forecast outcomes, including violations of parole, success in pilot training, and criminal recidivism.
Not surprisingly, Meehl’s book provoked shock and disbelief among clinical psychologists, and the controversy it started has engendered a stream of research that is still flowing today, more than fifty years after its publication. The number of studies reporting comparisons of clinical and statistical predictions has increased to roughly two hundred, but the score in the contest between algorithms and humans has not changed. About 60% of the studies have shown significantly better accuracy for the algorithms. The other comparisons scored a draw in accuracy, but a tie is tantamount to a win for the statistical rules, which are normally much less expensive to use than expert judgment. No exception has been convincingly documented.
The range of predicted outcomes has expanded to cover medical variables such as the longevity of cancer patients, the length of hospital stays, the diagnosis of cardiac disease, and the susceptibility of babies to sudden infant death syndrome; economic measures such as the prospects of success for new businesses, the evaluation of credit risks by banks, and the future career satisfaction of workers; questions of interest to government agencies, including assessments of the suitability of foster parents, the odds of recidivism among juvenile offenders, and the likelihood of other forms of violent behavior; and miscellaneous outcomes such as the evaluation of scientific presentations, the winners of football games, and the future prices of Bordeaux wine. Each of these domains entails a significant degree of uncertainty and unpredictability. We describe them as “low-validity environments.” In every case, the accuracy of experts was matched or exceeded by a simple algorithm.
Kahneman goes on to claim that oftentimes it is not even necessary to use statistical techniques like regression:
One can do just as well by selecting a set of scores that have some validity for predicting the outcome and adjusting the values to make them comparable (by using standard scores or ranks). A formula that combines these predictors with equal weights is likely to be just as accurate in predicting new cases as the multiple-regression formula that was optimal in the original sample. More recent research went further: formulas that assign equal weights to all the predictors are often superior, because they are not affected by accidents of sampling. ⋯ The surprising success of equal-weighting schemes has an important practical implication: it is possible to develop useful algorithms without any prior statistical research. Simple equally weighted formulas based on existing statistics or on common sense are often very good predictors of significant outcomes. ⋯ The important conclusion from this research is that an algorithm that is constructed on the back of an envelope is often good enough to compete with an optimally weighted formula, and certainly good enough to outdo expert judgment. This logic can be applied in many domains, ranging from the selection of stocks by portfolio managers to the choices of medical treatments by doctors or patients. (p. 226)
Here Kahneman overstates the case. First, “selecting a set of scores that have some validity for predicting the outcome” is not as straightforward as it might appear. It corresponds roughly to what is known as feature selection in machine-learning parlance, and can present significant challenges, especially if there isn’t enough data. Second, back-of-envelope algorithms using equal weighting are rarely as accurate as statistically derived models. Most of the algorithms used in practice, including VRAG and the credit-scoring algorithms that will be discussed later, are heavily based on statistics and it is highly unlikely that naive approaches would do as well. Moreover, oftentimes the stakes are so high that even small gains in accuracy can have significant financial or societal impact, so model optimization on the basis of historical data becomes practically inevitable. But Kahneman is correct to point out that even naive algorithms tend to outperform human judgment.
Given their impressive track record, should algorithms then replace human decision makers in “low-validity” environments? It depends. In some domains it might be impossible for an algorithm to capture the countless possible signals that can emerge during the decision-making process and which might be predictive of future behavior. An obvious example is predictive information conveyed by rare events, which are not factored into formulas because they cannot be anticipated. These are known in the literature as “broken-leg scenarios,” after a favorite example of Meehl’s about a formula that is highly successful in predicting whether someone will go to the movies, but becomes void if we discover that the subject is immobilized with a fractured femur, which then becomes the sole determinative factor. But there may be less contrived cases where the human ability to assimilate a constant influx of new information and environmental cues can be an advantage.50 For instance, in trying to assess a criminal offender’s risk of violent behavior, their demeanor during an interview with a human case worker might provide valuable insights, as was argued by Gottfredson and Moriarty in 2006. In such cases, a marriage of human judgment with the precision of a formula might perform better than the formula alone, especially when the combination is itself calibrated via a formula with empirically validated weights.
But note that access to additional information in general is not helpful to human decision makers. It might actually set them back, because it tends to spread their attention thin and can also invite them to consider unnecessarily “complex combinations of features,” as Kahneman points out (p. 224). And empirical evidence in support of combining clinical with statistical judgments is thin (Gottfredson and Moriarty do not present any such evidence of their own). By contrast, with the passage of time more and more research has accumulated in favor of statistical decision making. Consider, for instance, the vital problem of predicting the risk that a criminal offender will engage in violence if released. Specifically in the context of domestic violence, Berk et al. reported in 2016 that a machine learning model cut the false negative error rate of human magistrates in half (a 50% reduction):
Under current practice within the jurisdiction studied, approximately 20 percent of those released after an arraignment for domestic violence are arrested within two years for a new domestic violence offense. If magistrates used the methods we have developed and released only offenders forecasted not to be arrested for domestic violence within two years after an arraignment, as few as 10 percent might be arrested. The failure rate could be cut nearly in half. Over a typical 24-month period in the jurisdiction studied, well over 2,000 post-arraignment arrests for domestic violence perhaps could be averted.
One of the most influential actuarial methods for forecasting violent offender behavior in general (not just in the context of domestic violence) is a technique named Violence Risk Appraisal Guide, or VRAG for short, developed in the early 1990s. The Encyclopedia of Psychology and Law describes VRAG as follows:
The violence risk appraisal guide (VRAG) is an actuarial instrument that assesses the risk of further violence among men or women who have already committed criminal violence. On average, it has yielded a large effect in the prediction of violent recidivism in more than three dozen separate replications, including several different countries, a wide range of follow-up times, several operational definitions of violence, and many offender populations. It is the most empirically supported actuarial method for the assessment of violence risk in forensic populations. (p. 847)
The inventors of VRAG wrote an influential book, Violent Offenders: Appraising and Managing Risk, which has been published in three editions so far. It points to an overwhelming amount of empirical evidence in support of the predictive accuracy of their method and its superiority over clinical judgments.51 In earlier editions of their book, they had also endorsed combining their method with clinical judgments, not because they thought that this would increase the method’s effectiveness, but because they believed it would make their method more acceptable to clinicians (see below for a discussion of what has been termed algorithm aversion). By the time the third edition of their book was printed in 2015, they had this to say on the subject:
Having described the development and subsequent testing of actuarial systems, readers might be left wondering why all this effort was necessary in the first place. Couldn’t clinicians use their knowledge of the evidence and experience to render expert judgments about the risk of violence? This would allow them to include a host of factors that seem relevant to the particular case at hand and not to be confined to a short and unvarying list of factors. Permitting clinical intuition to play a role in formulating risk assessment would certainly make the incorporation of actuarial tools much more palatable to many practitioners. Indeed, two decades ago, we did give such permission (Webster, Harris, Rice, Cormier, & Quinsey, 1994) with that very purpose in mind—making actuarial methods more palatable. In this chapter and the one that follows, we review evidence and logic that now persuade us otherwise—the wishes of practitioners notwithstanding, clinical judgment in the manner just alluded to is unjustified, both empirically and ethically. Three quarters of a century’s research has severely shaken confidence in clinical judgment, in absolute terms and in comparison with actuarial methods. (pp. 171-172, my emphasis)
This passage comes from the beginning of chapter 6, entitled Clinical Judgment. The rest of that chapter presents a devastating amount of empirical evidence militating against human judgment under uncertainty in general, much of it deriving from the heuristics-and-biases literature that Kahneman and Tversky pioneered, and against clinical judgment specifically in the context of forensic work, such as this 2010 study by Desmarais et al., which “reported that forensic clinicians were generally extremely confident of their judgments about violence risk, but actual accuracy was inversely related to confidence.”
Despite all the evidence, widespread use of algorithms making autonomous decisions in socially sensitive domains such as the law or health care is unlikely in the near future, largely because we are biased against them. Writing in the Journal of Experimental Psychology, Dietvorst et al. noted:
Research shows that evidence-based algorithms more accurately predict the future than do human forecasters. Yet when forecasters are deciding whether to use a human forecaster or a statistical algorithm, they often choose the human forecaster. This phenomenon, which we call algorithm aversion, is costly, and it is important to understand its causes.
The authors conducted experiments showing that people “more quickly lose confidence in algorithmic than human forecasters after seeing them make the same mistake.” I will have more to say about algorithm aversion later.
But in some cases, particularly when real-time environmental signals are less important, algorithms have already taken the helm. One such case is credit underwriting, where relatively simple scoring formulas revolutionized the credit industry and have almost completely displaced human analysts.52 The next article in this series will discuss credit scoring in detail.
The MCAS software wasn’t exactly buggy, but it was certainly incorrectly designed and specified, insofar as it violated basic redundancy tenets in allowing erroneous input from one single sensor to initiate a nose dive. As this author put it: “It is astounding that no one who wrote the MCAS software for the 737 Max seems even to have raised the possibility of using multiple inputs, including the opposite angle-of-attack sensor, in the computer’s determination of an impending stall. As a lifetime member of the software development fraternity, I don’t know what toxic combination of inexperience, hubris, or lack of cultural understanding led to this mistake.”
The New York Times would beg to differ, of course, as they are claiming substantial amounts in damages from OpenAI, but copyright infringement claims “often seem far-fetched—where is the damage to the rights holder, exactly?”, as Louis Menand points out in an insightful recent piece in the New Yorker on the relationship between AI and IP law.
Both of these types of discrimination will be discussed extensively in a subsequent article.
The DFS report goes on to point out: “Although these couples complained that they were treated differently by the Bank, they typically had different credit scores, and a closer look revealed differences in credit profiles between spouses. In some instances, for example, one spouse was named on a residential mortgage, while the other spouse was not. Likewise, some individuals carried multiple credit cards and a line of credit, while the other spouse held only a single credit card in his or her name. These distinguishing characteristics can lead to differing credit offers. This is because creditors consider an applicant’s experience with and history of managing credit to be a predictor of future credit management, so a credit history with multiple tradelines and mix of credit types, such as a mortgage, student loans, and credit cards with on-time payments may produce more favorable credit terms than a sparser credit history with just one or two credit cards, even if the those cards are always paid off fully and timely.”
Sometimes there is no alternative, because only algorithms can do the job at the required scale. These cases represent novel applications of technology that do not involve simply replacing humans by machines; the use cases themselves are new. Their regulation can be more challenging, precisely because there is no natural baseline that can be used to assess the technology’s performance.
Notable exceptions are application processes driven by GUI-based questionnaires asking a series of questions intended to extract the values of specific structured-data fields, which have predetermined sets of possible values (e.g., a Boolean field will only allow yes and no as possible answers, an educational attainment field might have high_school, bachelors, masters, and phd as possible values, and so on). Such a process is likely to incorporate a number of essential “knock out” questions intended to enforce mandatory requirements, such as the applicant’s legal working status in a country or specific educational qualifications. That can indeed result in automated rejection if the candidate fails to satisfy a non-negotiable criterion (a surprisingly common occurrence, as usually the majority of submitted resumes fail basic job requirements), but that has nothing to do with AI and hardly qualifies as automated “resume screening.” It is rote collection of structured data followed by the enforcement of minimal job requirements on the collected data. Even when an attempt is made to auto-populate the answers by parsing an uploaded resume, the extracted field values are typically displayed in a GUI and the candidate has the opportunity to rectify any ingestion errors.
It is worth watching this short video clip on the subject, made by an experienced recruiter who has worked for Amazon, Google, and Microsoft.
I am not suggesting that software failures have never elicited such responses; some high-profile cases have. What I am suggesting is that the intensity of those responses is dwarfed by the scale and intensity of the pushback against AI.
It is widely thought that individual fairness is a more formal and conservative notion while group fairness is more substantive and egalitarian. I will be pointing out, however, that individual fairness aligns better with the normative individualism that underpins virtually all shades of liberalism.
It only takes a few examples to see why the skepticism is warranted. One of SAP’s “Guiding Principles for AI,” for instance, is that they “design for people.” In particular, “by providing human-centered user experiences through augmented and intuitive technologies, [they] leverage AI to support people in maximizing their potential.” This is marketing pablum, not a “Guiding Principle for AI.” OECD’s first AI principle is “Inclusive growth, sustainable development and well-being,” which dictates that AI must benefit people and the planet, e.g., by “augmenting human capabilities and enhancing creativity, advancing inclusion of underrepresented populations, reducing economic, social, gender and other inequalities, and protecting natural environments, thus invigorating inclusive growth, sustainable development and well-being.” Accenture UK’s “Ethical Framework for Responsible AI and Robotics” poses the question “So what is Responsible AI?” and proceeds to give this answer: “Responsible AI is the practice of designing, developing, and deploying AI with good intention to empower employees and businesses, and fairly impact customers and society,” apparently claiming that responsible AI is a matter of the practitioners’ mental states and not subject to any constraints between their practice and the external world. Similar examples abound.
A negative right is one that restrains other parties from behaving in certain ways towards the holder of the right. The right to be free from “unreasonable seizures and searches,” enshrined in the Fourth Amendment of the U.S. Constitution, is a classic example of a negative right that prevents an entity (in this case the government) from carrying out certain actions towards the bearers of the right (the citizens). By contrast, a positive right is essentially an entitlement. It gives the holder of the right a certain claim against another party, which is required to act towards the holder in a certain way, for example, to provide a certain good or service to them. Positive rights inherently entail obligations. Just as no one can take unless someone gives, no one can have a positive right unless someone else has a corresponding obligation. In the U.S., the Sixth Amendment gives every criminal defendant the right to be represented by an attorney at their trial, even if they cannot afford one. That’s a positive right that requires the government to provide the defendant access to a certain resource, namely, legal counsel.
First-generation human rights are generally understood to be those that emerged in the eighteenth century in connection with the French and American revolutions. Arguably, there is an even earlier generation of rights, one that was framed in the language of “natural rights” rather than human rights and was deeply rooted in theological thought.
“Everyone has the right to rest and leisure, including reasonable limitation of working hours and periodic holidays with pay.”
Technically speaking, a right is not the same as a human right. As Bart van der Sloot points out in his article on Legal Fundamentalism, a right is understood as a legal right; it is a notion that regulates the interactions of citizens with one another and with entities such as businesses. In the literature, these are distinguished both from constitutional rights, which regulate interactions of citizens with the state, and from human rights, which are universal, independent of any state, and inalienable (irrevocable). Nevertheless, in the Huq article cited in the main text, the “right to a human decision” seems to be conceived as a new fundamental right, a notion that appears to be a conceptual innovation of the EU that no one quite understands (meaning that no one knows how—or if—fundamental rights differ from human rights, or whether they are quasi-constitutional rights, or something else altogether; see section 1.4 of the van der Sloot article, “What is a Fundamental Right?”, for an informative discussion). See also this article, which views the “right to a human decision” as a “digital right” that is being transformed into a human right.
How about entertainment, for example? After all, being entertained is arguably an essential ingredient of a good life. If so, is access to entertainment not a positive human right, and is it not the government’s responsibility to provide it?
National Bureau of Economic Research.
See also the paper Religious affiliation and hiring discrimination in New England: A field experiment and this BBC article: Does religious bias begin with your CV?.
Refer, for instance, to this 2018 article, which used evidence from U.S. cities to show that “bail judges are racially biased against black defendants, with substantially more racial bias among both inexperienced and part-time judges.”
The literature has found scant support for the effectiveness of the somewhat related but broader diversity training programs in the corporate world. Refer to this comprehensive and in-depth 2021 review of relevant research, and see also Why Doesn’t Diversity Training Work? and this 2018 article.
A 2019 study found significant racial and gender disparities between the sentencing decisions of Republican-appointed judges and those of their Democrat-appointed peers, and that “these differences cannot be explained by other judge characteristics and grow substantially larger when judges are granted more discretion.”
The proverbial notion that “Justice is what the judge had for breakfast” has been folk wisdom for some time, but it wasn’t empirically examined until 2011, when a study by Danziger at al. posited the so-called “hungry-judge effect.” Danziger and his colleagues analyzed decisions made by Israeli parole boards and found that the probability of a favorable parole decision dropped from 65% at the beginning of a session in the morning to 0% just before a scheduled lunch break. Favorability rates abruptly returned to 65% after the break, only to gradually decline again. Because the magnitude of the reported effect was unusually large, critics suggested that the discrepancy might be explained, at least in part, by other factors (such as a non-random scheduling of the cases). See, for instance, this early critique, along with the reply to it by Danziger and his co-authors, defending the study’s conclusions, and a 2016 study by Andreas Glöckner that used simulation to argue that the magnitude of the hungry-judge effect had been overestimated.
While the results of the Danziger study might have been exceptionally pronounced, there is a sizable body of research documenting similar—if smaller—effects of visceral factors on human decision making, including the effect of sleep deprivation on judicial decision making (judges impose more severe sentences when they have not had enough sleep), and the impact of emotions “that are unrelated to the merits of the case” on judicial decisions—as determined, for instance, by whether or not the local football team won the day before the decision was made; see pp. 926-927 of the paper by Dan Priel for additional references. A more recent 2020 paper that reviewed that debate acknowledged that “prima facie, its [the 2011 paper by Danziger at al.] general thrust, namely the suggestion that judicial decision making might be affected by the time of the last meal, and thereby factors as glucose, hunger, mood, or mental fatigue, seems plausible. It ties in with other studies demonstrating physiological, psychological, or situational influences on moral decision making of lay persons and legal experts.” (And even the Glöckner paper acknowledged that “there is clear evidence that judicial decision making is influenced to some degree by extraneous factors, which is also reflected in prevailing theories in law and psychology.”)
For a list of many more cognitive biases affecting medical professionals, see The Importance of Cognitive Errors in Diagnosis and Strategies to Minimize Them, by P. Croskerry.
See Cognitive Bias in Clinical Medicine by E. O’Sullivan and S. J. Schofield
Discrepancy and Error in Radiology: Concepts, Causes and Consequences, A. Brady, E. Ó. Laoide, P. McCarthy, and R. McDermott. For additional context, see the 2017 paper Error and discrepancy in radiology: inevitable or avoidable?, by A. P. Brady.
As Garland relates in the 1959 paper, when he told one of his friends, a well-known professor of radiology, that radiologists missed about one third of roentgenologically positive findings, the friend expressed the hope that Garland would discontinue his “investigations in this field because they were so morale-disturbing.” And when other radiologists were confronted with this data, their usual reaction was “Well, in my everyday work this does not apply; I would do better than those busy investigators”—a classic manifestation of over-confidence bias, the “most prevalent and more potentially catastrophic problem in judgment and decision making” (The psychology of judgment and decision making, S. Plous, McGraw-Hill, 1993, p. 217).
Disagreements in Chest Roentgen Interpretation, P. G. Herman et al.
See also the 2015 paper Discrimination in Healthcare Settings is Associated with Disability in Older Adults: Health and Retirement Study; the 2020 paper Ageism Amplifies Cost and Prevalence of Health Conditions; and the 2022 paper Healthcare Professionals’ Views and Perspectives towards Aging.
See also the 2023 paper Evidence of Racial Discrimination in the $1.4 Trillion Auto Loan Market.
See, for instance, the 2019 article Rulemaking and Inscrutable Automated Decision Tools, and also the book by Barocas et al., Fairness and Machine Learning, pp. 24–26.
Rothstein makes a powerful case that “today’s residential segregation in the North, South, Midwest, and West is not the unintended consequence of individual choices” but rather that of “public policy” that was “systematic and forceful” (from the book’s preface, my emphasis). It was a coordinated policy that was institutionalized by lenders and supported by government agencies. What were the supposed explainability and accountability “norms” of those bureaucracies, and what good were they to the policy’s victims?
With the sole exception of the Supreme Court, which routinely overturns previous decisions, appellate decisions affirm previous criminal decisions over 90% of the time, in a phenomenon that has been dubbed the affirmation bias of the appellate court system. See the paper The Futility of Appeal: Disciplinary Insights into the “Affirmance Effect” on the United States Courts of Appeals and also the 2019 article Why Appeals Courts Rarely Reverse Lower Courts: An Experimental Study to Explore Affirmation Bias.
See also this article.
For example, see this 2018 paper.
Unpublished decisions go back much further (to the 19th century) but did not start to become common until the 1970s.
See the 2009 article Unpublished opinions: A convenient means to an unconstitutional end by Erica S. Weisgerber.
From the 2005 article Commentary: Unpublication and the Judicial Concept of Audience, by J. M. Shaughnessy (p. 1598).
From the 2005 article Inequitable Injunctions: The Scandal of Private Judging in the U.S. Courts, by P. Pether (p. 1438).
From the article The Unpublished, Non-Precedential Decision: An Uncomfortable Legality?, by M. H. Weresh (pp. 181-182).
One such reason is fear of litigation. Attorneys routinely advise HR organizations against providing any form of feedback to rejected applicants because anything that might be said could be misinterpreted and potentially used to launch a complaint against the company. Even if no legal action is taken, if the employer provides specific and accurate (honest) reasons for the rejection, persistent applicants are apt to challenge those reasons and attempt to change the employer’s mind; the ensuing debate is not likely to be productive. A related reason is limited time and resources. Many firms receive thousands of job applications and it would be infeasible to provide personalized feedback to all applicants. Finally, even if time was not an issue, it is simply not possible for most organizations to coach a candidate on how to improve their chances of getting hired (that is not their job anyway). Hiring managers and HR staff are aware of their own internal hiring needs but can hardly speak for what other companies might be looking for. Given that “explanations” of individual screening and/or hiring rejections are virtually never provided by human recruiters, it is hard to see why that demand should be placed on algorithmic methods.
It is interesting to note that until the 1920s, admission decisions in elite US colleges were made on purely quantitative or “objective” grounds, such as grades. The move to a more “holistic” approach started in the 1920s because by that point the exclusive focus on scholastic criteria had resulted in the admission of what was deemed to be an excessive number of Jewish applicants, almost all of them immigrants from Eastern Europe. See Jerome Karabel’s The Chosen - The Hidden History of Admission and Exclusion at Harvard, Yale, and Princeton. Following the Supreme Court’s 2003 decision on Grutter vs Bollinger, the term “holistic” in the context of college admissions became effectively synonymous with the use of racial considerations in the admissions process. In their Grutter vs Bollinger decision, the Supreme Court court essentially sanctioned the consideration of race in college admissions, as long as “each applicant is evaluated as an individual and not in a way that makes an applicant’s race or ethnicity the defining feature of his or her application,” finding that the University of Michigan Law School engages in “a highly individualized, holistic review of each applicant’s file, giving serious consideration to all the ways an applicant might contribute to a diverse educational environment.” As is well known, in June 2023 Grutter vs Bollinger was effectively overturned (or rather retired, since Grutter vs Bollinger already had a “sunset provision” in 2003 anticipating that “25 years from now the use of racial preferences will no longer be necessary to further the interest approved today”) on the grounds that it violated the Equal Protection Clause.
Even in egregious cases, it’s rare for either individual physicians or for hospitals to assume responsibility, as the customary practice is to “deny and defend”. Pro forma institutional mechanisms such as peer reviews have a mostly cosmetic function because a pervasive “club culture” encourages insiders to close ranks and protect one another by erecting a “white wall of silence” around any allegation of malpractice, akin to the “blue wall of silence” found in police departments. These are sadly common forms of institutional corruption, which “are not strictly illegal yet pervert an institution’s function under conditions that may promote personal benefit”.
In the US, the passage of HIPAA (the Health Insurance Portability and Accountability Act) in 1996 gave patients the right to access their medical records, including clinical notes written by physicians, at least in principle (access was not federally mandated until the 21st Century Cures Act of 2021), and it could be argued that such notes contain a physician’s reasoning about a patient’s case. But clinical notes are written in a highly idiomatic style teeming with abbreviations and esoteric medical terminology, and would not be comprehensible by most laypeople (e.g., in a clinical note “SOB” stands for “shortness of breath” but will likely elicit a different interpretation from a patient). They are primarily intended to facilitate efficient communication among healthcare professionals and as mnemonic aids for the physicians who write them, not as explanatory texts to be consumed by patients. Indeed, the movement towards “open notes” in the US and in some European countries has some physicians worried that they will have to start oversimplifying clinical notes, thereby undermining the functions for which they were originally intended. As a result, some healthcare professionals, most notably in Norway, have reportedly started keeping “shadow notes” that are inaccessible to patients. Moreover, in practice, “clinical notes produced with EHRs frequently contain redundant information and errors, and may never be read despite containing relevant information for patient care”, as rampant use of cut-and-paste results in bloated and poorly organized notes. Incidentally, writing clinical notes is a key reason given for physician burnout (for every one hour spent with a patient, two hours are spent writing notes). For that reason, ironically, the writing of clinical notes is now being relegated to generative AI.
A Human Rights Watch report on Internal Affair divisions in U.S. police departments stated that
it is alarming that no outside review, including our own, has found the operations of internal affairs divisions satisfactory. In each city we examined, internal affairs units too often conducted substandard investigations, sustained few allegations of excessive force, and failed to identify and punish officers against whom repeated complaints had been filed. Rather, they, in practice, often shielded officers who committed human rights violations from exposure and guaranteed them immunity from disciplinary sanctions or criminal prosecution. (my italics)
See the Wikipedia article for the evolution of the doctrine from the 1960s until today.
Supreme Court decisions themselves are thought to be swayed by “a staggering variety of extralegal social factors,” including, according to professors Neal Devins and Lawrence Baum,
the informing role of the ``elite social networks that the Justices are a part of.'' Relying on a ``psychological model'' of judicial behavior, they argue that judges are acutely sensitive to the views and values of other elites, especially to the relatively cohesive networks of opposed elites that gather together in such activist ideological organizations as the conservative Federalist Society and the liberal American Constitution Society.
(From the 2022 paper Exploring the interpretation and application of procedural rules: the problem of implicit and institutional racial bias, by Edward A. Purcell Jr., p. 2586.)
For a deeper dive into the views of Devins and Baum, see their 2019 book The Company They Keep. The Purcell paper gives a number of interesting examples of institutional bias at the Supreme Court level.
To take one of many examples, consider the case of E. Jessop vs. the City of Fresno, quoted here from the paper Qualified Immunity: A Legal Fiction That Has Outlived Utility (pp. 41–42):
[p]olice officers executing a search warrant in relation to alleged illegal gambling machines produced an inventory sheet stating that they had seized $50,000 from the suspects. But the officers had actually seized $151,380 in cash and another $125,000 in rare coins and simply pocketed the difference between what they seized and what they reported—effectively using a search warrant to steal more than $225,000. The Ninth Circuit granted immunity to the officers.
Refer to the paper for additional examples and background.
Incidentally, civil asset forfeiture by law enforcement, whereby police officers (or DEA agents, etc.) can seize property from citizens who have not been convicted of—or even charged with—any crimes has ballooned into a multi-billion-dollar activity over the last few decades. In 2014, for example, federal law enforcement “took more stuff from people than burglars did.” Civil forfeiture is a byproduct of the war on drugs; it was very rarely used before the 1980s, and when it was, the funds went directly to the government's coffers, not to those of law enforcement. It was only after the passage of CCCA (the Comprehensive Crime Control Act) in 1984 by the Reagan administration that law enforcement agencies were allowed to keep seized assets, which thus became a huge financial incentive for them. Net annual deposits went from $27 million in 1985 to $2.8 billion in 2019, an increase of over 10,000%.
See this article for an excerpt from that book, and also this paper by the same author.
Both Sotomayor and Thomas have spoken against it.
Or sometimes actuarial judgments, a nod to their joint historical origins in the insurance industry.
That’s the same ability that gives people an edge in tasks like driving. Being able to read the body language and facial expressions of other drivers, pedestrians, and bike riders, for instance, turns out to be crucial for driving, and this is the sort of thing that humans do effortlessly.
See, for instance, this 2013 paper, which describes how VRAG has stood up over time and reports an AUC of roughly 0.75. Incidentally, those who have spent time scrutinizing the COMPAS algorithm will have no doubt run across mentions of VRAG in materials such as the COMPAS Practitioner’s Guide, which includes performance numbers for VRAG (the same AUC of about 75%) and other actuarial methods in order to contextualize COMPAS’s performance.
Humans still make credit decisions when very large amounts of money are involved.