Selection Bias: Joseph Berkson's 1946 Paradox and Why the Long-Term Equity Investor Must Ask Who Is Missing From the Data

Afternoon Edition · Mental Models · No. 5

In 1943 the United States Army Air Forces brought a practical problem to the Statistical Research Group, a band of mathematicians working quietly out of an apartment near Columbia University. Bombers were being lost over Europe, and the service wanted to know where to add armour. The planes that returned were covered in data: clusters of bullet holes across the wings, the fuselage and the tail, and conspicuously few around the engines. The intuitive reading was to reinforce the parts that were most often hit. Abraham Wald, a refugee logician with no aeronautical training, gave the opposite instruction. Armour the engines — the place where the returning planes showed almost no damage at all.

Wald’s reasoning was not about metallurgy. It was about the sample. The data described only the aircraft that came back. Planes hit in the engine were under-represented in the hangar for the simple reason that they were at the bottom of the Channel. The absence of engine damage among survivors was not evidence that engines were rarely struck; it was evidence that engine strikes were fatal. The most important information in the problem was missing by construction, and the obvious conclusion was exactly backwards. This is selection bias, and it is among the most expensive errors an investor can make precisely because it hides inside data that looks complete.

The model: a 1946 warning about who gets counted

The phenomenon was given its first rigorous statistical treatment three years after Wald’s memo. In 1946 the physician-statistician Joseph Berkson published “Limitations of the Application of Fourfold Table Analysis to Hospital Data” in the Biometrics Bulletin (vol. 2, no. 3, pp. 47–53). Berkson was studying associations between diseases using hospital records, and he proved something disquieting: two conditions that are entirely unrelated in the general population can appear strongly negatively correlated once you restrict your attention to people who have been admitted to hospital. The correlation is not real. It is manufactured by the act of selection itself.

The one-sentence form of the model is this: when you analyse a sample that has been filtered by some outcome, the filter can invent relationships that do not exist and erase relationships that do. Selection bias is therefore not a flaw in the data-gathering of careless people. It is a structural property of any dataset whose membership depends on the very thing you are trying to study. Survivorship bias — the better-known cousin that Wald confronted — is one special case, the case where the filter is survival. Berkson’s contribution was to show that the problem is general: any selection rule, not only survival, can distort the picture, and the distortion can run in either direction.

Berkson’s original illustration used three medically unrelated conditions — diabetes, inflammation of the gallbladder, and refractive eye problems. In the wider population they have nothing to do with one another. Among hospitalised patients they become entangled, because being admitted at all is more likely if you have any one of them, and the pool of patients is defined by that admission. Condition on the gate, and you contaminate everything you measure downstream.

The mechanism: conditioning on a collider

Modern statistics has a precise vocabulary for why this happens. A “collider” is a variable that is jointly caused by two others. Hospital admission is a collider: it is caused both by having disease A and by having disease B. When you select your sample on a collider — by looking only at the admitted, only at the survivors, only at the funds still open, only at the founders who got the meeting — you open a backdoor path between the two causes and an association appears where there was none. The 2018 paper by Marcus Munafò and colleagues, “Collider scope: when selection bias can substantially influence observed associations” (International Journal of Epidemiology, vol. 47, no. 1, pp. 226–235), demonstrates with simulations how readily a modest selection effect produces sizeable phantom correlations, and how invisible the mechanism is to anyone looking only at the selected data.

The reason the error is so durable is that the selected sample is usually internally consistent and superficially rich. Wald’s engineers had thousands of genuine data points. Berkson’s hospital tables were real. The numbers were not wrong; the population they described was simply not the population anyone thought they were studying. Selection bias does not announce itself through noise or small samples. It corrupts the inference quietly, while leaving the spreadsheet looking authoritative. That is what makes it a mental model worth internalising rather than a statistical footnote — the discipline it demands is a habit of asking, before any calculation, “what determined whether a data point made it into this table?”

The trap is easiest to feel with a market example. Suppose an analyst studies only companies that have already grown large and admired, and within that elite finds that two genuine virtues — a conservative balance sheet and rapid growth — appear to trade off against one another. The temptation is to conclude that prudence and growth are natural enemies. But in the full population of companies the two may be entirely unrelated; the apparent trade-off can be produced by the entry gate alone, because a firm usually needs at least one of the two qualities to become large enough to enter the sample at all. Having neither keeps a company small and invisible; so among the giants the two virtues look mutually exclusive when they are nothing of the kind. The relationship is an artefact of admission, exactly as in Berkson’s ward — and an analyst who never looks outside the elite has no way to detect it.

Scatter plot showing two independent traits that appear negatively correlated once the sample is restricted to points that clear a selection gate. — Figure 1. Berkson’s paradox: two traits that are independent in the full population appear negatively correlated once the sample is restricted to those that clear a selection gate. After Berkson, Biometrics Bulletin (1946).

The empirical record: from labour economics to fund tables

The economics profession was forced to confront selection bias in its own house. When economists tried to estimate the return to education or the determinants of wages, they could only observe wages for people who chose to work. The decision to work is not random — it depends on the very characteristics, many of them unobserved, that also drive wages. James Heckman formalised the cure in “Sample Selection Bias as a Specification Error” (Econometrica, vol. 47, no. 1, January 1979, pp. 153–161), showing that a non-random sample is a species of omitted-variable bias and supplying a two-step correction now taught in every graduate programme. The work was important enough that Heckman shared the 2000 Nobel Memorial Prize in Economic Sciences “for his development of theory and methods for analysing selective samples.” The point is worth dwelling on: an entire discipline discovered that its headline estimates had been quietly wrong because of how its samples were assembled.

Capital markets supply the cleanest financial illustration. Studies of mutual-fund performance once drew their samples from the funds that still existed at the end of the period. But funds close, and the ones that close are disproportionately the poor performers. Brown, Goetzmann, Ibbotson and Ross laid out the problem in “Survivorship Bias in Performance Studies” (Review of Financial Studies, vol. 5, no. 4, 1992, pp. 553–580), showing that a sample truncated by survival can manufacture the very appearance of performance persistence that researchers thought they were discovering. Burton Malkiel, examining equity funds over 1971–1991 (Journal of Finance, vol. 50, no. 2, 1995), estimated that survivorship inflated apparent returns by on the order of 1.5 percentage points a year; Brown and Goetzmann’s own estimates ranged from roughly 0.2 to 0.8 points depending on weighting. A point and a half a year, compounded across a saver’s lifetime, is the difference between a comfortable retirement and a thin one — and it was an artefact of which funds were allowed into the table.

The subtlety in the Brown, Goetzmann, Ibbotson and Ross result is worth isolating, because it is counter-intuitive. Truncating a sample by survival does not merely lift the average; it can manufacture the appearance of skill that repeats. Funds that take large risks and lose tend to vanish, while those that take the same risks and happen to win remain to be measured again in the next period. The survivors therefore look as though good performance predicts good performance, when what is really on display is volatility passed through a survival filter. An investor screening for “consistent outperformers” in such data is chasing a statistical shadow.

Nor is the bias confined to funds; the securities themselves are filtered. A market index is a managed list from which failures are quietly removed and into which yesterday’s winners are added, so a back-history of “the index” flatters the experience of anyone who actually lived through the deletions. Long-run studies that rely on databases which silently drop delisted shares inherit the same defect: the worst outcomes — the frauds, the bankruptcies, the zeros — are precisely the observations most likely to disappear from the file. The careful reader of any long-term return series asks first whether the dead are still in it.

Bar chart comparing the reported return of surviving funds only with the lower actual return that includes funds which have closed. — Figure 2. Illustrative. Restricting a performance sample to the funds that survived inflates the reported return; documented survivorship bias has run roughly 0.2 to 1.5 percentage points a year (Brown, Goetzmann, Ibbotson and Ross, 1992; Malkiel, 1995).

The quantitative era has produced a more aggressive version of the same disease. A researcher who tries hundreds of strategy variations and reports only the best one has, in effect, selected on outcome. David Bailey, Jonathan Borwein, Marcos López de Prado and Qiji Zhu, in “Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance” (Notices of the American Mathematical Society, vol. 61, no. 5, 2014, pp. 458–471), prove that a dazzling backtested track record is trivially easy to produce simply by trying enough configurations, and that without knowing how many were tried, an investor cannot tell skill from selection. The backtest is the modern hospital ward: a sample defined by the outcome it is supposed to predict.

Two episodes where the model explains the outcome

The first episode is Wald’s, and its lesson outlived the war. The Statistical Research Group’s insight — that the planes you can examine are a biased sample of the planes you sent — has been retold for decades, most memorably by the mathematician Jordan Ellenberg in How Not to Be Wrong (2014). The reason it endures is that the structure recurs everywhere capital is deployed: the deals that closed, the tenants who renewed, the customers who did not churn, the loans that did not default in the sample window. Each is a returning bomber. The information you most need is carried by the cases that never made it into your file.

The second episode is the slow rewriting of the mutual-fund performance literature through the 1990s. For years the industry and parts of academia could point to evidence that the average actively managed fund roughly kept pace with the market and that winners repeated. As researchers reconstructed the dead funds and put them back into the sample, the apparent persistence weakened and the average record deteriorated. The funds had not changed; the sample had been completed. The episode is a permanent caution that the most flattering statistic in finance — “the funds in this category returned X” — is only as honest as its treatment of the funds that no longer exist to be averaged. What unites the bomber and the fund table is that in both the decisive evidence was invisible by design, and in both the correct move was to reason about the missing cases rather than the present ones.

Application: three operating disciplines for the long-term investor

Markets are an unusually pure selection machine, which is why the model earns its keep here more than almost anywhere else. Prices, indices and league tables are continuously rewritten to drop the failures: a collapsing share is delisted and leaves the index, a disappointing fund is merged away, a strategy that stops working is retired and forgotten. The data an investor inherits has already been groomed by survival before it is ever opened. Three disciplines turn the abstract warning into daily practice.

The first discipline is to reconstruct the denominator before trusting any rate. Whenever a track record, a screen or a category average is presented, the first question is not “how good is the number?” but “what is the full population this was drawn from, and what happened to the members that are missing?” A list of “the best-performing companies of the last decade” is selected on the outcome and tells you almost nothing about what to do at the start of a decade. A survivorship-free history that includes the delisted, the acquired-in-distress and the wound-up is harder to find and far more informative. When the graveyard is unavailable, the right adjustment to any flattering figure is downward. In practice this means favouring data sources and indices that retain their delisted and dead constituents, and mentally adding the casualties back whenever only a pre-cleaned list is on offer.

The second discipline is to treat every backtest and case study as guilty until the trial count is disclosed. Following Bailey and López de Prado, the operative question for any quantitative claim is “how many variations were examined before this one was shown to me?” A strategy presented with no account of the search behind it should be discounted heavily, and out-of-sample or genuinely forward evidence should be weighted far above in-sample fit. The same scepticism applies to the narrative case study: a book that studies ten great companies and extracts their common habits has conditioned on success, and its “lessons” may be present in equal measure among the failures it never examined. A workable rule of thumb is to grant an undocumented backtest no evidential weight at all until its author can say how wide the search behind it was.

The third discipline is to ask who is missing from the room. The success stories an investor hears — at conferences, in founder interviews, in the marketing of funds that were quietly incubated until they had a record worth showing — are a sample drawn by the selection rule “did it work?” The base rate lives among the cases that were never written up. Before extrapolating from a vivid winner, the long-term investor deliberately seeks the silent population: the comparable ventures that closed, the strategies abandoned, the managers who shut up shop. The discipline is uncomfortable because it requires manufacturing the absent data through imagination and reference classes rather than reading it off a screen. Reference-class forecasting — listing the comparable cases before judging the one in front of you — is the concrete tool, and it is worth remembering that fund sponsors exploit the opposite habit, launching many small funds and promoting only the few that happen to compile a record worth showing.

How the long-term equity tradition has used it

The most elegant engagement with selection bias in investing is Warren Buffett’s 1984 address at Columbia Business School, delivered to mark the fiftieth anniversary of Graham and Dodd’s Security Analysis and published as “The Superinvestors of Graham-and-Doddsville” in the school’s Hermes magazine that autumn. Buffett anticipated the efficient-market objection that a handful of successful value investors prove nothing — that if enough people flip coins, some will produce long winning streaks by chance, exactly as a national coin-flipping contest among orangutans would eventually crown a few champions. His answer is a masterclass in thinking about selection. If the winning orangutans turned out to come disproportionately from one zoo, you would suspect something about that zoo. His superinvestors were not selected after the fact from the universe of all money managers; they were identifiable in advance as disciples of a common intellectual approach, and they prospered in different securities and different decades. The shared causal origin, specified before the results were known, is what distinguishes skill from a story drawn out of a selected tail.

The discipline has a contemporary steward in Michael Mauboussin, the investment strategist and long-time Columbia faculty member, whose The Success Equation: Untangling Skill and Luck in Business, Sports, and Investing (Harvard Business Review Press, 2012) is built around the problem of reading outcomes correctly. His “paradox of skill” — that as competitors become more uniformly skilled, luck explains more of the spread in their results — is a direct consequence of how performance samples are generated, and his recurring instruction is to study the full distribution and the relevant base rate rather than the celebrated extreme. Between Buffett’s orangutans and Mauboussin’s distributions sits a single habit of mind: never reason from a sample without first interrogating the rule that produced it. It is no accident that both men teach the same corrective — widen the frame until the missing cases come back into view.

Line chart showing the best in-sample Sharpe ratio rising with the number of strategy configurations tried while the true out-of-sample Sharpe stays near zero. — Figure 3. The best backtested Sharpe ratio rises with the number of strategy configurations tried even when true skill is zero; the expected maximum grows like the square root of twice the natural logarithm of N. After Bailey, Borwein, López de Prado and Zhu (2014).

Key takeaways

The filter is the danger, not the data. Selection bias arises whenever membership in a sample depends on the outcome being studied; the numbers can be perfectly accurate and the inference still backwards, as Wald’s bombers and Berkson’s hospital tables both show.
Selection can invent correlations and hide them. Conditioning on a collider — survival, admission, a fund still trading, a deal that closed — opens phantom relationships and erases real ones, in either direction.
The effect is large and measurable. Survivorship alone has inflated reported equity-fund returns by something like 1.5 percentage points a year (Malkiel, 1995); backtest overfitting can fabricate an impressive record from noise (Bailey et al., 2014).
The cure is procedural. Reconstruct the denominator, demand the trial count behind any backtest, and seek the silent population of failures before extrapolating from a winner.
Specify the hypothesis before the sample. Buffett’s defence of value investing works because the cohort was defined by a shared method in advance — the antidote to selection is a prior, not a louder result.

— Manish Goel, FCA / NorthPath Advisory OÜ / Tallinn, Estonia

Important.
All content on this site and in this email is journalism and education for a general audience. Nothing here constitutes investment advice or a recommendation in respect of any specific financial instrument, nor an offer or solicitation to buy or sell any security. Readers should consult an authorised financial adviser regulated in their own jurisdiction before making any investment decision.

Selection Bias: Joseph Berkson’s 1946 Paradox and Why the Long-Term Equity Investor Must Ask Who Is Missing From the Data

The model: a 1946 warning about who gets counted

The mechanism: conditioning on a collider

The empirical record: from labour economics to fund tables

Two episodes where the model explains the outcome

Application: three operating disciplines for the long-term investor

How the long-term equity tradition has used it

Key takeaways

More posts

Naive Diversification: Benartzi and Thaler’s 1/n Heuristic, and Why the Long-Term Equity Investor Should Never Let the Menu Choose His Portfolio

The AAA That Wasn’t: A Retrospective on the IL&FS Group Failure

The Closet Indexer: Murray Stahl’s Critique, the Active-Share Evidence, and Why the Long-Term Equity Investor Should Refuse to Pay Active Fees for Index Exposure

Stocks as Lotteries: Barberis and Huang’s 2008 Model of Skewness Preference, and Why the Long-Term Equity Investor Should Distrust the Long Shot