The Prosecutor’s Fallacy: Thompson and Schumann’s 1987 Anatomy of the Transposed Conditional, and Why the Long-Term Equity Investor Must Ask Every Question in Both Directions

Cover: the prosecutor's fallacy — P(match given innocent) is not P(innocent given match)

Afternoon Edition — Mental Models · Essay No. 7 · 7 June 2026 · Tallinn

1. The model

In 1987, Law and Human Behavior carried a paper by William C. Thompson and Edward L. Schumann of the University of California, Irvine, under a deliberately plain title: “Interpretation of Statistical Evidence in Criminal Trials: The Prosecutor’s Fallacy and the Defense Attorney’s Fallacy” (vol. 11, no. 3, pp. 167–187). The paper did something unusual in the history of applied probability. It took an arithmetic error that judges, juries, physicians and investors had been committing anonymously for centuries, gave it a name, and then measured, under controlled conditions, how often intelligent people commit it. The error is this: treating the probability of the evidence, assuming innocence, as if it were the probability of innocence, given the evidence.

The setting Thompson and Schumann used is worth keeping in view because every later example in this essay is a costume change on the same body. A crime is committed. A trace — a blood type, a fibre, a partial DNA profile — links the perpetrator to the defendant, and a forensic witness testifies that the matching characteristic occurs in, say, one person in a thousand. The prosecutor then argues: there is only a 0.1 per cent chance the defendant would match if he were innocent; therefore there is a 99.9 per cent chance he is guilty. That inference feels like algebra. It is not. It is the fallacy. The 0.1 per cent describes how often the evidence appears among the innocent. It says nothing, by itself, about how likely innocence is once the evidence has appeared — because that depends on how many innocent people were in the pool of possible matches to begin with. In a city of one million, a one-in-a-thousand trait belongs to roughly a thousand people. The match has narrowed the field from a million to a thousand; it has not narrowed it to one.

Thompson and Schumann named the mirror image too. The defense attorney’s fallacy accepts the thousand-person pool and concludes the evidence is therefore worthless — the defendant is just one among a thousand, a 0.1 per cent case, so the match should be ignored. This is equally wrong, in the opposite direction. The match multiplied the odds on guilt a thousandfold; evidence that shrinks the candidate pool from a million to a thousand is far from worthless. It is simply not a verdict.

The one-sentence form of the model: the probability of the evidence given the hypothesis is not the probability of the hypothesis given the evidence, and the gap between the two is set by the base rate. Statisticians have older names for the underlying slip — the confusion of the inverse, the fallacy of the transposed conditional — and the formal antidote has existed since Thomas Bayes’s posthumous essay of 1763, which this series treated in Essay No. 1. What Thompson and Schumann added in 1987 was a demonstration, on 217 experimental subjects, that the error is not an exotic failure of the untrained but the default behaviour of the educated mind.

2. The mechanism

Why are the two conditionals so different, and why does the mind insist on swapping them? The first question is answered by Bayes’ rule itself. The posterior probability of a hypothesis is the likelihood of the evidence under that hypothesis, weighted by the hypothesis’s prior probability, and normalised by how often the evidence occurs overall. Transposing the conditional quietly deletes the prior. It behaves as if guilt and innocence started the trial as equally likely — as if there were only two candidates in the city, not a million. Whenever the prior probability of the hypothesis is small — one offender among a metropolis, one accounting fraud among thousands of honest filings, one extreme compounder among the whole listed universe — the two conditionals diverge not by shades but by orders of magnitude. P(match | innocent) can be one in a thousand while P(innocent | match) stands near 99.9 per cent: both numbers true, in the same courtroom, at the same time.

The second question — why the mind swaps them — has a more behavioural answer. Language is symmetrical where probability is not: “the chance of a match if he is innocent” and “the chance he is innocent if he matches” differ by two words, and working memory does not reliably preserve the difference. Thompson and Schumann’s first experiment showed that the packaging of the same fact steers the direction of error: when the incidence statistic was phrased as a conditional probability, subjects’ mistakes ran in the prosecution’s favour; when the identical fact was phrased as a percentage of the population, errors ran toward the defence. Gerd Gigerenzer’s research programme, summarised in Reckoning with Risk (2002), pushed the diagnosis further: single-event probabilities — “a 0.1 per cent chance” — are a historically recent, cognitively unnatural format. Re-expressed as natural frequencies — “of 10,000 people, ten would match; one is the offender” — the same problem becomes almost transparent, because the denominator that the fallacy hides is physically present in the sentence.

Seen this way, both fallacies are the same bookkeeping failure: a Bayesian inference needs two numbers held simultaneously — the prior odds, and the likelihood ratio (how much more often the evidence appears under guilt than under innocence). The prosecutor’s fallacy throws away the prior and keeps the likelihood ratio; the defense attorney’s fallacy throws away the likelihood ratio and keeps the prior. Each side of the courtroom discards precisely the number that hurts its case, which is why Thompson and Schumann’s title reads like an indictment of advocacy itself.

Natural-frequency tree showing 10,000 people, of whom 10 are offenders and 9,990 are innocent; forensic test flags all 10 offenders and about 100 innocents, so a flagged person is guilty roughly 1 time in 11
Figure 1. The denominator the fallacy hides. A one-in-a-hundred trait, traced through a population of 10,000 in natural frequencies: the same match that is “99 per cent accurate” in transposed language leaves the flagged individual roughly 91 per cent likely to be innocent.

3. The empirical record

The experimental record begins with the naming paper itself. Thompson and Schumann put written versions of both fallacious arguments — a prosecutor’s transposition and a defence attorney’s dismissal — in front of their subjects. A majority failed to detect the flaw in at least one of the two arguments, and made probability judgments consistent with the fallacious reasoning they had endorsed. Across both experiments, subjects also systematically underused the statistic relative to the Bayesian benchmark — the now-familiar finding that people neither ignore base rates nor honour them, but improvise.

The most damning evidence, however, comes from medicine, where the same inversion decides diagnoses rather than verdicts. In 1978, Casscells, Schoenberger and Graboys (New England Journal of Medicine, vol. 299, pp. 999–1001) put a one-line problem to sixty physicians, house officers and students at Harvard teaching hospitals: if a disease has a prevalence of 1 in 1,000 and the test for it has a false-positive rate of 5 per cent, what is the chance that a person with a positive result actually has the disease? The Bayesian answer, assuming a sensitive test, is about 2 per cent. Eleven of the sixty got it. The most common answer — given by twenty-seven of sixty — was 95 per cent: the test’s accuracy, transposed wholesale into a posterior probability, the prosecutor’s fallacy in a white coat. Thirty-five years later, Manrai, Bhatia, Strymish, Kohane and Jain repeated the exercise verbatim on sixty-one Boston-area physicians and trainees (JAMA Internal Medicine, vol. 174, 2014, pp. 991–993). Fourteen of sixty-one answered correctly. The modal answer was still 95 per cent. The median answer, 66 per cent, overstated the true probability by a factor of thirty-three. A generation of statistical education had moved the needle from 18 per cent correct to 23 per cent — a difference within the noise.

The encouraging half of the record belongs to format. Gigerenzer and Hoffrage (Psychological Review, vol. 102, 1995, pp. 684–704) showed that when the identical diagnostic problems are restated as natural frequencies — counts of people in a concrete population rather than single-event percentages — the rate of correct Bayesian reasoning multiplies severalfold, in physicians as well as students. The fallacy, in other words, is not a fixed defect of the species; it is substantially an artefact of notation. Institutions eventually responded in kind: the English Court of Appeal condemned the transposition by name in R v. Deen (1994), and the Royal Statistical Society took the rare step, in October 2001, of issuing a public statement against the “misuse of statistics in the courts” while a particular appeal — to which we now turn — was pending.

Grouped bar chart comparing the 1978 Casscells study and the 2014 Manrai replication: share answering the diagnostic problem correctly (18 and 23 per cent) versus share answering 95 per cent (45 and 44 per cent), against a true answer of about 2 per cent
Figure 2. Thirty-six years, no learning. Harvard physicians in 1978 (Casscells et al., NEJM) and Boston physicians in 2014 (Manrai et al., JAMA Internal Medicine) given the same one-line screening problem: the modal answer both times was 95 per cent; the true answer was about 2 per cent.

4. Two historical episodes

People v. Collins, California, 1968. An elderly woman was robbed in a Los Angeles alley in June 1964; witnesses recalled a blonde woman with a ponytail fleeing to a yellow car driven by a Black man with a beard and moustache. The prosecutor called a mathematics instructor, assigned each characteristic an invented frequency — one couple in a thousand interracial, one car in ten yellow, one man in four moustached, and so on — multiplied them as if independent, and announced that the chance of any couple matching the description was one in twelve million. The jury was invited to hear that number as the chance the Collinses were innocent. The California Supreme Court reversed the conviction in 1968 (68 Cal. 2d 319), and its judgment remains the canonical dissection of the model. The court made three cuts. The component frequencies had no empirical basis. The multiplication assumed independence where none existed — bearded men are disproportionately moustached. And, most instructively, the court appended a calculation showing that even taking the one-in-twelve-million figure at face value, in a region with millions of couples there was a probability of well over 40 per cent that at least one other couple matched the description. The prosecutor’s number, even if true, measured the rarity of the evidence — not the probability of guilt.

R v. Sally Clark, England, 1999–2003. Sally Clark, a solicitor, lost her first son at eleven weeks in 1996 and her second at eight weeks in 1998. At her murder trial, the eminent paediatrician Sir Roy Meadow testified that in a family like hers — affluent, non-smoking, mother over twenty-six — the probability of a single cot death was about 1 in 8,543, taken from the CESDI study of infant deaths; squaring it, he put the chance of two natural cot deaths at “1 in 73 million.” Both moves were wrong, and each compounded the other. The squaring assumed the deaths were independent, when shared genetic and environmental factors make a second cot death in an affected family far likelier than the first — Ray Hill’s later analysis in Paediatric and Perinatal Epidemiology (vol. 18, 2004, pp. 320–326) put the dependence at several multiples of the baseline. And the resulting figure was received, in court and in headlines, as the probability that Clark was innocent — a transposition of exactly the Collins type, since two-infant-death families are themselves vanishingly rare, and murder is rarer among them still. The Royal Statistical Society’s October 2001 statement said plainly that the 1-in-73-million figure had “no statistical basis.” Clark’s conviction was quashed on her second appeal in January 2003, after she had served more than three years. The same template — a small likelihood paraded as a posterior — ran through the Dutch case of the nurse Lucia de Berk, convicted in 2003 and exonerated in 2010.

Timeline from 1968 to 2010: People v. Collins reversal 1968, Casscells physician study 1978, Thompson and Schumann name the fallacy 1987, R v. Deen 1994, Sally Clark convicted 1999, Royal Statistical Society statement 2001, Clark conviction quashed 2003, Lucia de Berk exonerated 2010
Figure 3. Four decades of the same error. The transposed conditional travels from a Los Angeles robbery trial to the Harvard wards to the English Court of Appeal — named in 1987, condemned by the Royal Statistical Society in 2001, and still convicting in 2003.

5. Application to long-term equity investing

An equity investor’s reading diet is a procession of conditionals, almost all of them quoted in the seductive direction. “Ninety per cent of frauds showed aggressive accrual growth.” “Nearly every hundred-fold compounder was founder-led.” “Every post-war recession was preceded by an inverted yield curve.” Each statement reports the probability of the evidence given the hypothesis — P(trait | fraud), P(trait | compounder), P(inversion | recession) — and each quietly invites the reader to act on the transpose: to treat the trait as a verdict of fraud, the founder as a guarantee of compounding, the inversion as a certainty of recession. Three operating disciplines translate the model into practice.

First: write every conditional in both directions before acting on it. For any piece of evidence E offered in support of a thesis H, the investment journal should record two numbers, not one: how often E appears when H is true, and how often E appears when H is false. The ratio of the two — the likelihood ratio — is the entire evidentiary content of E. Messod Beneish’s M-Score (“The Detection of Earnings Manipulation,” Financial Analysts Journal, vol. 55, no. 5, 1999) is the disciplined version of the fraud conditional: it catches a large share of manipulators precisely because it was built on both distributions. But since deliberate manipulation is rare in any given year, the arithmetic of small priors guarantees that most firms the screen flags are not frauds — the screen is a reason to investigate, never a verdict. The investor who forgets this commits the prosecutor’s fallacy against a company; the investor who therefore dismisses screens entirely commits the defense attorney’s.

Second: rebuild any quoted statistic as natural frequencies over an explicit reference class. This is Gigerenzer’s remedy, imported to securities analysis. “Most extreme compounders were founder-led” becomes: of the several thousand listed companies that were founder-led at the start of a decade, how many compounded a hundredfold? The moment the denominator is forced into view, the transposition dies of exposure — usually taking the thesis’s glamour with it. Hendrik Bessembinder’s finding that a few dozen firms account for the bulk of net equity wealth creation, treated earlier in this letter, makes the point structural: the base rate of extreme outcomes in equities is so low that any trait-based conditional, transposed, will overstate the odds by orders of magnitude.

Third: count the opportunities before being impressed by a coincidence. The Collins appendix is a permanent piece of investing equipment. Evidence that looks one-in-twelve-million stops being remarkable once twelve million couples have had the chance to produce it. A backtested signal significant at one-in-a-thousand, discovered after a thousand variations were scanned, is the expected harvest of noise. A fund manager’s ten-year winning streak must be judged against the thousands of managers who each had a chance to produce one by luck. Before treating any pattern as evidence, ask the Collins question: how many draws were taken? The answer routinely converts the extraordinary into the merely arithmetical.

6. How the long-term equity tradition has used it

Charlie Munger placed elementary probability at the foundation of the latticework in his 1994 talk at the University of Southern California’s Marshall School, “A Lesson on Elementary, Worldly Wisdom” (reprinted in Poor Charlie’s Almanack, 2005). The first mental models he names there are the “elementary math” of Fermat and Pascal — permutations, combinations, and decision-tree thinking — and his standard for mastery is not acquaintance but reflex: the discipline has to be used routinely, daily, or it will be elbowed aside by whatever is vivid. Berkshire Hathaway’s insurance operations are that reflex institutionalised. An underwriter is, in effect, paid to keep the conditional pointing in the right direction — to price the probability of loss given the risk class, against the constant temptation of vivid recent evidence argued in the transposed direction.

Howard Marks devoted an entire memo to the probabilistic frame — “You Bet!” (Oaktree Capital, January 2020) — drawing on poker and backgammon to argue that investing is a game of incomplete information and luck in which the quality of a decision cannot be read off the quality of its outcome. That distinction is the conditional-probability discipline in working clothes: judging P(outcome | decision) across the distribution of futures, rather than reasoning backwards from a single observed outcome to the merit of the decision — the transposition that turns one lucky result into “skill” and one unlucky one into “error.” Michael Mauboussin’s The Success Equation (2012) systematised the same point for security selection: the more luck contributes to short-run outcomes, the more an investor must score the process — the forward conditional — because outcomes alone, transposed into judgments of skill, are statistically slanderous in both directions.

7. Key takeaways

  • The model in one line. The probability of the evidence given the hypothesis is not the probability of the hypothesis given the evidence; Thompson and Schumann (1987) named both directions of the confusion — the prosecutor’s fallacy and the defense attorney’s fallacy — and showed a majority of subjects swallow at least one.
  • The gap is the base rate. Transposing a conditional silently assumes the hypothesis started fifty-fifty. When the prior is small — one offender in a city, one fraud in a market, one extreme compounder in a listed universe — the two conditionals diverge by orders of magnitude.
  • Format is a tool. Single-event percentages hide the denominator; natural frequencies over a concrete population (Gigerenzer & Hoffrage, 1995) expose it, and with it the fallacy. Rebuild every quoted statistic as counts in a reference class.
  • Evidence is a likelihood ratio, not a verdict. A screen that catches most frauds still mislabels most of the firms it flags, because frauds are rare. Record how often the evidence appears when the thesis is false before acting on how well it fits when true.
  • Count the opportunities. The Collins court’s appendix generalises: rare-looking coincidences are manufactured wholesale by wide searches — across backtests, managers, and patterns — so ask how many draws were taken before being impressed.

— Manish Goel, FCA / NorthPath Advisory OÜ / Tallinn, Estonia

Important.
All content on this site and in this email is journalism and education for a general audience. Nothing here constitutes investment advice or a recommendation in respect of any specific financial instrument, nor an offer or solicitation to buy or sell any security. Readers should consult an authorised financial adviser regulated in their own jurisdiction before making any investment decision.