The Law of Small Numbers: Tversky and Kahneman’s 1971 Warning That Small Samples Exaggerate, and Why the Long-Term Equity Investor Should Distrust Short Track Records

Afternoon Edition · Mental Models · No. 6

The model: a counterfeit of a true law

Since the year 1150 the English crown has audited its own coinage in a ceremony called the Trial of the Pyx. Sample coins are drawn from the mint, locked in a wooden box — the pyx — and weighed in the aggregate against a standard. Because no minting process is perfectly precise, the contract between the crown and the mint allowed a tolerance: the sampled coins, taken together, could deviate from the standard by a small fixed fraction of their total weight. The drafters made an assumption so natural that nobody questioned it for six centuries. They assumed that the allowable wobble in a hundred coins is simply a hundred times the allowable wobble in one. Variation, in other words, was priced as if it grew in proportion to the size of the sample.

It does not. In 1730, in a work called Miscellanea Analytica, the French-born mathematician Abraham de Moivre showed that the variability of an average shrinks not with the size of the sample but with its square root. A hundred coins do not wobble a hundred times more than one coin; in aggregate their average wobbles ten times less. The Trial of the Pyx had therefore spent nearly six hundred years granting the mint a tolerance far looser than honest minting required — slack that anyone running the mint could, in principle, harvest by casting coins systematically near the permissive edge. The statistician Howard Wainer, who tells the story in “The Most Dangerous Equation” (American Scientist, vol. 95, 2007), nominates de Moivre’s little formula — the standard error equals the standard deviation divided by the square root of the sample size — as the most dangerous equation in the world, dangerous not to those who know it but to everyone who does not.

The psychology of why we do not know it — why centuries of educated people keep mispricing the reliability of small samples — was pinned down in 1971, in a five-page paper that helped found the modern study of judgement. In “Belief in the Law of Small Numbers” (Psychological Bulletin, vol. 76, no. 2, pp. 105–110), Amos Tversky and Daniel Kahneman circulated questionnaires at two professional meetings — the Mathematical Psychology Group and the American Psychological Association — and asked working scientists routine questions about sampling, replication and experimental design. The respondents were people who used statistics for a living. Their answers were systematically, confidently wrong. They treated small samples as if they carried the evidential authority of large ones, expected marginal results from twenty subjects to replicate as a matter of course, and recommended follow-up experiments so small that the odds of confirming a true effect were close to a coin flip.

Tversky and Kahneman gave the error a deliberately satirical name. The law of large numbers — the genuine theorem, proved by Jacob Bernoulli in 1713 and examined in an earlier letter in this series — guarantees that the average of a very large sample sits close to the truth. The bias they documented is the intuition that this guarantee extends downward: that the law of large numbers applies to small numbers as well. People behave, they wrote, as if every sample, however modest, must resemble the population that produced it.

The one-sentence form of the model: a small sample is mostly noise wearing the costume of signal, and intuition cannot see the costume.

The mechanism: why the square root defeats intuition

Three gears drive the error. The first is representativeness — the mind’s habit of expecting any sample to be a faithful miniature of its parent population, locally as well as globally. A fair coin is expected to look fair in every stretch of ten flips, so a run of six heads feels like a distortion that the next flips must correct. That specific corollary — the gambler’s fallacy, and its mirror image the hot hand — deserves and will receive its own essay; what matters here is the parent error, the miniature-expectation itself.

The second gear is the arithmetic intuition that the Pyx drafters got wrong: variation feels as if it should scale in proportion to sample size, when de Moivre showed it scales with the square root. The practical consequence is brutal and almost universally unpriced. To halve the noise in an estimate you must quadruple the observations. To cut it by a factor of ten you need a hundred times the data. Evidence therefore arrives far more slowly than the calendar suggests, and a sample that is twice as big is nowhere near twice as informative. Intuition, billing variance at a linear rate, treats three good quarters as half the proof of six — when in truth the difference between them is barely worth a tolerance band.

The third gear is asymmetric vividness. A small sample is usually a story — one fund, one school, one clinical trial, one quarter — and stories arrive with detail, faces and apparent causes. The correction term, the standard error, arrives as an abstraction with a square-root sign on it. In any contest between a vivid streak and an invisible denominator, the streak wins. This is also where the present model parts company with the one examined in the previous letter in this series. Selection bias corrupts who ends up in the sample; the law of small numbers operates even when the sample is perfectly drawn. The data can be clean, unbiased and honestly gathered, and still — if there is too little of it — be incapable of saying what the observer hears it say.

Small-multiples chart comparing ten samples of five coin flips against ten samples of fifty, showing wildly scattered small-sample averages versus tightly clustered large-sample averages around the true 50 per cent line — Figure 1. The same fair coin, sampled two ways. Ten samples of five flips produce averages scattered from 20 to 80 per cent; ten samples of fifty hug the true value. Nothing about the coin changed — only the square root in de Moivre’s denominator. After Tversky and Kahneman (1971) and Wainer (2007).

The empirical record: trained minds, untrained intuitions

The 1971 paper’s most unsettling finding was its choice of subjects. These were not undergraduates misjudging dice; they were professional researchers with formal statistical training, and their intuitions failed in exactly the way the theory of representativeness predicts. Asked to design a follow-up to a marginally significant experiment, they proposed sample sizes that gave roughly even odds of confirming a real effect — and then expressed near-certainty that the confirmation would arrive. Their expectations had been formed by the miniature-expectation, not by the power calculation they knew how to perform. Jacob Cohen had already measured the institutional consequence: surveying a year of published psychology in 1962, he found the median study had less than a fifty-fifty chance of detecting the medium-sized effects it was hunting. The discipline’s eventual replication crisis was, in large part, the law of small numbers operating at industrial scale.

The pattern recurs wherever averages are ranked. Wainer and Harris Zwerling examined the 1,662 Pennsylvania public schools reporting fifth-grade test scores and found that among the 50 top-scoring schools — the top 3 per cent — the smallest 3 per cent of schools appeared at four times their expected rate. The natural inference is that small schools are better. But the bottom of the table destroyed it: nine of the 50 worst-scoring schools — 18 per cent — were also among the 50 smallest. Small schools were not better or worse; they were noisier, over-represented at both extremes precisely as de Moivre requires (Phi Delta Kappan, 2006). The same triangle appears in American county health data, where the counties with the lowest kidney-cancer rates are overwhelmingly small, rural ones — and so are the counties with the highest rates (Wainer, 2007; Kahneman, Thinking, Fast and Slow, 2011, ch. 10).

Investing supplies the cleanest field evidence, because fund league tables are ranked averages refreshed every year. S&P Dow Jones Indices’ U.S. Persistence Scorecard (year-end 2024 edition) tracked the large-cap equity funds that sat in the top performance quartile as of 2020: by the end of 2024, none remained in the top quartile. Not few — none. Joel Greenblatt reported the same physics from the other direction in The Big Secret for the Small Investor (2011): among the managers whose full-decade record for 2000–2010 placed them in the top quartile, 97 per cent spent at least three of those ten years in the bottom half of the field, 79 per cent spent at least three years in the bottom quartile, and 47 per cent spent at least three years in the bottom decile. Read those two findings together and the lesson is symmetrical: short windows manufacture both false saints and false sinners, and the investor who hires after three good years and fires after three bad ones is trading on the noise term of de Moivre’s equation.

Two purchases of noise

The first episode cost about $1.7 billion. In the late 1990s the Bill and Melinda Gates Foundation, joined by Annenberg, Carnegie, the Pew Charitable Trusts and the U.S. Department of Education’s Smaller Learning Communities Program, began funding the conversion of large American high schools into small ones. By 2001 the foundation’s education grants supporting the push totalled roughly $1.7 billion, and districts in New York, Los Angeles, Chicago and Seattle were splintering big schools into clusters of small ones. The evidence behind the policy was a ranking: among the highest-performing schools in the country, small schools were conspicuously over-represented. The claim was true. It was also exactly what sampling theory predicts for institutions whose test-score averages rest on few pupils — and nobody had checked the bottom of the table, where small schools were over-represented too. By October 2005 the Seattle Times was reporting that the foundation had moved away from breaking up large schools; researchers presenting at Brookings later concluded the conversions had done students measurable harm. A philanthropy of unusual rigour had bought the upper tail of a variance distribution, at scale (Wainer, 2007; Schneider, Wyse and Keesler, 2007).

The second episode wore a halo for fifteen years. From 1991 through 2005, the Legg Mason Value Trust under Bill Miller beat the S&P 500 after fees every single calendar year — the most celebrated streak in modern fund management, and the foundation of a widespread conviction that skill of a rare order had been identified. The ensemble arithmetic said something quieter. Leonard Mlodinow worked it through in The Drunkard’s Walk (2008): given the thousands of funds operating across the post-war decades, the probability that some fund, somewhere, would assemble a fifteen-year calendar streak by luck alone comes out near three chances in four. A streak was always likely; the only question was which name it would attach to. Miller himself was more candid than his admirers, calling the run “an accident of the calendar” — had the year ended in different months, he noted, it would not exist, and he put the luck share at perhaps 95 per cent. The sequel behaved like noise reverting. The fund trailed the index in 2006 and 2007, then lost roughly 55 per cent in 2008 against about 37 for the S&P 500; assets fell from $16.5 billion to $4.3 billion by December 2008 through losses and redemptions, the fund lagged in five of Miller’s final six years, and he stepped away from it in 2012 (Mauboussin, The Success Equation, 2012). None of this proves Miller had no skill — his later record suggests he had a great deal. It proves that fifteen annual data points, drawn from an ensemble of thousands, could never have measured it.

Two-panel diptych: the Gates Foundation small-schools initiative of 2001 with its 1.7 billion dollar commitment and four-times over-representation statistic, beside the Legg Mason Value Trust fifteen-year streak of 1991 to 2005 and its aftermath — Figure 2. Two purchases of noise. Small schools dominated the top of the league tables because they dominated both tails; a fifteen-year streak was close to an ensemble certainty before it was anyone’s achievement. Sources: Wainer (2007); Wainer and Zwerling (2006); Mlodinow (2008); Mauboussin (2012).

Application: three operating disciplines

01 · Count the tries, not just the wins

A record is a sample drawn twice — once from the volatility of the strategy, and once from the population of competitors who could have posted it. Before crediting any streak, league-table position or “best performing” label, reconstruct the ensemble: how many funds, managers, products or model portfolios were running, such that this one could surface? A three-year number one among ten thousand candidates is closer to a lottery announcement than a diagnosis. The same discipline applies inside the portfolio. A thesis confirmed by two quarters of delivery has not been confirmed; a concept store that works in four locations out of four tells you almost nothing about location forty — the binomial spread on four trials is enormous. The question that defuses the vividness of every small sample is the same: out of how many tries?

02 · Judge over spans long enough for signal to beat noise

Since noise shrinks only with the square root of observations, the practical defence is to pre-commit to evaluation windows long enough for signal to dominate — and to refuse interim verdicts, in both directions. Warren Buffett institutionalised this in the Berkshire Hathaway letters as early as 1983, insisting on a minimum five-year test of performance against the index and asking, memorably, why the time a planet takes to circle the sun should have anything to do with the time business decisions take to pay off. Nick Sleep built the same rule into the Nomad Investment Partnership’s letters, repeatedly asking his partners to judge the partnership over rolling five-year spans rather than the annual figures convention demanded. The point of the pre-commitment is not patience as a virtue; it is sample size as a requirement. Five years is not always enough — but one year is arithmetically never enough.

03 · Mind de Moivre’s square root in every comparison of averages

Whenever two averages are compared — fund A against fund B, this quarter’s margin against last year’s, the small division’s growth against the large one’s — ask what sample sizes sit underneath, because ranked averages without their denominators are a machine for promoting the noisiest. Expect small units to colonise both ends of any league table: the best-performing branch, the worst sales region, the most accurate forecaster of the year. Where the data cannot be enlarged, shift the weight of the judgement onto evidence that does not shrink with √n — the economics of the business, incentives, competitive structure, the logic of the accounting — and treat the short statistical series as decoration. The investor cannot make small samples informative; the only available decision is to stop paying signal prices for noise.

Three discipline cards on navy: count the tries not just the wins; judge over five-year spans; mind de Moivre's square root in every comparison of averages — Figure 3. The counter-machine. Three operating disciplines that convert de Moivre’s equation and Tversky and Kahneman’s 1971 result into daily investment practice.

How the long-term tradition has used it

The long-horizon equity tradition discovered this model in practice long before it adopted the vocabulary. Buffett’s 1983 letter made the five-year test a governing principle of Berkshire’s own self-assessment, and his letters return repeatedly to the theme that single-year results — including Berkshire’s good ones — are unworthy of either celebration or apology. The deeper architecture of the holding company reflects the same idea: an owner who refuses to be measured annually is an owner who has stopped paying for noise.

Howard Marks has spent four decades writing the same warning from the asset-management side. His January 2014 memo “Getting Lucky,” and the chapter on luck in The Most Important Thing (2011), build on a lesson he dates to his student days: in a world this random, the quality of a decision cannot be judged from a single outcome, because bad decisions routinely pay off and good ones routinely fail in any short window. Oaktree’s client letters therefore treat short-term relative performance — favourable or not — as close to meaningless, an attitude that is simply the law of small numbers internalised as firm culture.

Joel Greenblatt turned the model into a hiring screen. The statistics he assembled in 2011 — the 97 per cent of decade-winning managers who spent years in the bottom half — were aimed at allocators who fire at the bottom of the noise cycle and hire at the top, guaranteeing themselves the worst of both. And Nick Sleep’s Nomad letters pushed the discipline to its logical end: if short records cannot distinguish skill from chance, then the manager’s job is to design measurement — five-year judgement spans, destination over speed — so that neither he nor his partners are ever tempted to act on a sample too small to mean anything.

Key takeaways

The law of large numbers is true; the bias is believing it works fast. Tversky and Kahneman’s 1971 finding is that even trained minds treat small samples as miniatures of the population — the law of large numbers applied to small numbers.
Noise shrinks with the square root, not the count. De Moivre’s 1730 equation means halving the noise requires quadrupling the data; three good quarters, four successful stores and a two-year track record are arithmetically close to silence.
Ranked averages promote the noisiest. Small schools, small counties and small funds dominate both tails of every league table; the top of any short-window ranking is mostly variance on display.
Ensembles manufacture streaks. Across thousands of funds, a fifteen-year winning streak was always close to an ensemble certainty — which is why zero top-quartile large-cap funds of 2020 stayed top-quartile to 2024, and 97 per cent of a decade’s best managers spent years in the bottom half.
The defences are procedural. Count the tries behind every winner; pre-commit to five-year judgement spans as Buffett (1983) and Sleep’s Nomad letters did; and when the sample cannot grow, weight the evidence that does not shrink with √n — the economics, not the streak.

— Manish Goel, FCA / NorthPath Advisory OÜ / Tallinn, Estonia

Important.
All content on this site and in this email is journalism and education for a general audience. Nothing here constitutes investment advice or a recommendation in respect of any specific financial instrument, nor an offer or solicitation to buy or sell any security. Readers should consult an authorised financial adviser regulated in their own jurisdiction before making any investment decision.

The Law of Small Numbers: Tversky and Kahneman’s 1971 Warning That Small Samples Exaggerate, and Why the Long-Term Equity Investor Should Distrust Short Track Records

The model: a counterfeit of a true law

The mechanism: why the square root defeats intuition

The empirical record: trained minds, untrained intuitions

Two purchases of noise

Application: three operating disciplines

01 · Count the tries, not just the wins

02 · Judge over spans long enough for signal to beat noise

03 · Mind de Moivre’s square root in every comparison of averages

How the long-term tradition has used it

Key takeaways

More posts

Naive Diversification: Benartzi and Thaler’s 1/n Heuristic, and Why the Long-Term Equity Investor Should Never Let the Menu Choose His Portfolio

The AAA That Wasn’t: A Retrospective on the IL&FS Group Failure

The Closet Indexer: Murray Stahl’s Critique, the Active-Share Evidence, and Why the Long-Term Equity Investor Should Refuse to Pay Active Fees for Index Exposure

Stocks as Lotteries: Barberis and Huang’s 2008 Model of Skewness Preference, and Why the Long-Term Equity Investor Should Distrust the Long Shot