Calibration: Tetlock, Mellers and the Good Judgment Project, and Why the Long-Term Equity Investor Should Keep Score of His Own Forecasts

Behavioural Finance · The NorthPath Letter · Afternoon Edition

There is a number that separates the forecaster worth listening to from the one who merely sounds worth listening to, and it is not the number most people would guess. It is not the boldness of the prediction, nor the confidence with which it is delivered, nor the eminence of the person delivering it. It is the quiet correspondence between the probabilities a person attaches to events and the rate at which those events actually occur. When a meteorologist says there is a seventy per cent chance of rain, and it rains on roughly seventy per cent of the days she says so, she is calibrated. When a market commentator declares himself “certain” a downturn is imminent, and turns out wrong about as often as he is right, he is not. The body of research that measures this correspondence is among the most quietly devastating in all of social science, and the long-term equity investor — who lives by forecasts of the future, whether or not he admits to making them — has more to learn from it than almost anyone.

The bias: confidence is not accuracy

Calibration is a precise idea. A person is well-calibrated if, across all the occasions on which they say an outcome is ninety per cent likely, that outcome happens about ninety per cent of the time; and across all the occasions on which they say sixty per cent, it happens about sixty per cent of the time. Calibration is not the same as being right. A forecaster who says “seventy per cent” before every coin-flip of a biased coin that lands heads seventy per cent of the time is perfectly calibrated while predicting nothing in particular. What calibration measures is honesty of a special kind: whether the confidence you express is worth the price the listener pays for it.

The tools for measuring it are old. In 1950 the meteorologist Glenn Brier proposed what is now called the Brier score — the average squared distance between a stated probability and what actually happened, where zero is perfect and larger numbers are worse. It was designed to grade weather forecasts, and it does something subtle: it rewards a forecaster both for being calibrated and for being decisive, and it punishes false confidence harder than honest uncertainty. By 1982, when Sarah Lichtenstein, Baruch Fischhoff and Lawrence Phillips surveyed the field in their chapter “Calibration of Probabilities: The State of the Art to 1980,” in Kahneman, Slovic and Tversky’s landmark volume Judgment under Uncertainty, the verdict was already clear and unflattering: across most domains and most people, subjective probabilities are systematically overconfident. People assign too much certainty to too many things.

The most vivid demonstration belongs to Marc Alpert and Howard Raiffa, whose probability-training experiments are reported in the same volume. They asked subjects to give ranges — a low and a high — wide enough that the true value of some unknown quantity would fall inside them ninety-eight times out of a hundred. If people were honest about their ignorance, only two per cent of the true values should have fallen outside those ranges. In practice the miss rate was vastly higher, frequently above forty per cent. People’s ninety-eight-per-cent intervals were, in reality, closer to sixty-per-cent intervals. Raiffa’s exasperated instruction to his subjects has become a small classic of the literature: “For heaven’s sake, spread those extreme fractiles! Be honest with yourselves! Admit what you don’t know!” The bias has a name in the modern literature — overconfidence, or more precisely overprecision — but its operational signature is always the same: stated confidence runs ahead of demonstrated accuracy.

A reliability diagram: the 45-degree diagonal marks perfect calibration; an overconfident curve bows below it, showing that high-confidence forecasts come true less often than claimed; weather forecasters sit near the line, pundits well below. — Figure 1. The reliability diagram. A forecaster is calibrated when the events to which she assigns a given probability occur at that rate — the diagonal. Overconfidence bows the curve below the line: the things people call “ninety per cent certain” happen rather less than nine times in ten.

The mechanism: a story feels like evidence

Why is the gap so reliable? The deepest answer in the literature is Daniel Kahneman’s: the mind judges confidence from the coherence of the story it can assemble, not from the completeness of the evidence behind it. Kahneman compressed the idea into an ungainly acronym, WYSIATI — what you see is all there is. We reason from the information in front of us as though no other information existed, and a tidy, internally consistent narrative produces a feeling of certainty that is almost entirely independent of how much we actually know. A forecaster who can tell a clean causal story about why a company must compound for a decade feels confident in proportion to the cleanness of the story, not in proportion to the base rate of decade-long compounders. The coherence is real; the calibration is imaginary.

Two further mechanisms make the investor’s case worse than the average. The first is the inside view: we forecast a particular case by building up its specific details — this management, this product, this market — and almost never by asking what happened to the broad reference class of cases that looked like it. The inside view is seductive precisely because it engages everything we know and admits nothing we don’t. The second, and more important for investing specifically, is the structure of feedback. Calibration is learned, where it is learned at all, from prompt and unambiguous error signals. The weather forecaster discovers by tomorrow afternoon whether it rained. The investor discovers whether he was right only after years, by which time the outcome has been so thoroughly contaminated by luck, by unrelated macro events, and by his own subsequent trades that the original forecast can no longer be cleanly graded. The signal that would teach calibration arrives late, noisy, and confounded — which is to say, it effectively never arrives at all. The market is the worst classroom in the world for the one lesson the investor most needs to learn.

The empirical record: two decades of being barely better than chance

The definitive study is Philip Tetlock’s, summarised in his 2005 book Expert Political Judgment: How Good Is It? How Can We Know? Over roughly twenty years Tetlock gathered some 28,000 datable, scoreable predictions from 284 professionals who made their living analysing political and economic trends, and graded them against what actually happened. The headline finding has passed into legend: the average expert was only marginally more accurate than what Tetlock memorably called a dart-throwing chimpanzee — that is, than chance allocation across the possible outcomes. Two refinements matter more than the headline. First, fame was negatively correlated with accuracy: the experts most in demand by television producers, the boldest and most quotable, were among the worst calibrated, because confident storytelling and accurate forecasting are nearly opposite skills. Second, Tetlock found a personality split he borrowed from Isaiah Berlin. “Hedgehogs,” who know one big thing and force every question through a single grand theory, were reliably worse than “foxes,” who know many small things, hold their views loosely, and stitch together fragments from incompatible schools. The fox is not smarter; the fox is better calibrated, because the fox is forever hedging.

The story does not end in despair, and this is the part that should interest investors most. Between 2011 and 2015 the United States intelligence community, through its research arm IARPA, ran a four-year forecasting tournament pitting university teams against one another on hundreds of real geopolitical questions, all scored by the Brier rule. The winning team, by a wide margin, was the Good Judgment Project, run by Tetlock and Barbara Mellers. Their central result, reported by Mellers and colleagues in Psychological Science in 2014, is that accuracy is partly a skill rather than a fixed trait, and that three things sharpen it: training, teaming and tracking. A probability-training module that took about an hour — teaching reference-class thinking, the language of degrees of belief, and the habitual avoidance of the round numbers zero and one hundred — produced a measurable, durable improvement in calibration. The project also identified a thin top layer of “superforecasters” whose edge persisted year after year, ruling out luck. Calibration, it turns out, can be taught, practised and improved, but only under the one condition the market refuses to supply on its own: a relentless, scored feedback loop.

There is a clean positive control sitting in plain sight. Professional weather forecasters are, in study after study, among the best-calibrated human beings ever measured: when an experienced forecaster says seventy per cent chance of rain, it rains very close to seventy per cent of the time. They are not better people than pundits or portfolio managers; they simply operate inside the discipline the laboratory demands, receiving a graded result every single day and being held to it. Where the feedback loop exists, calibration follows. Where it is absent, overconfidence is the default.

Financial regulators, tellingly, have institutionalised exactly this discipline for the institutions they cannot trust to self-calibrate. In the United States, the Federal Reserve and the Office of the Comptroller of the Currency’s 2011 supervisory guidance on model risk management — the document banks know as SR 11-7 — requires “outcomes analysis”: a bank that builds a model assigning probabilities to events must routinely compare those probabilities against what actually occurred, and recalibrate when reality disagrees. In Europe and globally, the Basel Committee’s internal-ratings-based framework obliges banks that estimate their own probabilities of default to validate and back-test those estimates against realised default rates, a requirement the European Banking Authority polices in detail. Strip away the supervisory language and the instruction is identical to Raiffa’s: write down your probabilities, keep score against outcomes, and widen your estimates when the world proves you wrong. Regulators have built, by force of law, the feedback loop the individual forecaster will not build for himself.

A scoreboard contrasting Tetlock's experts, who barely beat chance over two decades, with the Good Judgment Project, whose training, teaming and tracking measurably improved Brier scores and produced persistent superforecasters. — Figure 2. Two decades of expert forecasts, and what improved them. Tetlock’s 284 experts barely outpaced chance and the most famous fared worst; the Good Judgment Project then showed that training, teaming and tracking sharpen accuracy — lower Brier scores are better.

Two episodes: when a precise number was a miscalibrated one

The first episode is the most expensive calibration failure in modern financial history, and it wore the costume of precision. A credit rating is, at bottom, a probability statement: a security rated triple-A is one whose chance of default is asserted to be vanishingly small. In the years before 2008 the major agencies stamped that assertion onto trillions of dollars of mortgage-backed securities. The United States Financial Crisis Inquiry Commission, in its 2011 report, recorded the reckoning: of all the mortgage securities Moody’s had rated triple-A in 2006, seventy-three per cent were subsequently downgraded to junk. By another tally cited in the same proceedings, eighty-three per cent of some 869 billion dollars of 2006-vintage triple-A mortgage paper was eventually downgraded. The ratings were not approximately right with a wide error band; they were confidently, catastrophically wrong, and their very precision — the crisp letter grade, the implied near-zero default probability — is what persuaded the world to act on them. A miscalibrated probability is far more dangerous when it is delivered with a clean label than when it is delivered with a visible shrug.

The second episode is quieter and more chronic. Economists, as a profession, almost never forecast recessions. The economist Prakash Loungani distilled the record into a sentence that has followed the field around ever since: “The record of failure to predict recessions is virtually unblemished.” A 2018 International Monetary Fund working paper that he co-authored examined dozens of recessions across dozens of countries and found that the overwhelming majority were not foreseen by the consensus even a year in advance; a great many were still not being forecast in the very year they arrived. The failure is not stupidity — these are capable people with good data — but the absence of the calibration loop, compounded by an incentive structure that punishes a lonely correct alarm more than it punishes a comfortable shared error. Against this stands one institution worth holding up as a model of the opposite habit: the United Kingdom’s Office for Budget Responsibility publishes, every year, a Forecast Evaluation Report that grades its own past forecasts against what actually happened and dwells publicly on where it went wrong — its persistent overestimation of productivity growth, its large 2021 underestimate of inflation. It is precisely the exercise no pundit volunteers for, and exactly the one the investor should imitate.

The counter-measures: three disciplines that buy calibration

Calibration cannot be willed into existence by resolving to be humble. It is bought, the way the weather forecaster and the Good Judgment Project bought it, with machinery. Three disciplines do most of the work.

First, keep score. The single act that separates the calibrated forecaster from the confident one is a written record. Before a decision, write the claim, attach an explicit probability to it — not “I’m confident,” but a number — and set a date by which it will resolve. When the date arrives, resolve it honestly and compute the gap between what you said and what happened. Over a few dozen entries this becomes your personal Brier score, and it will almost certainly reveal that the events you called ninety per cent likely happened nearer to seventy. This is not paperwork for its own sake; it is the manufacture, by hand, of the feedback signal the market withholds. An investor who has kept such a log for two years knows something about himself that no amount of introspection could supply.

Second, spread the fractiles. Raiffa’s instruction is a discipline, not a joke. When you estimate a range — a company’s earnings three years out, a fair multiple, a downside in a bad scenario — deliberately widen it past the point of comfort, because the evidence is overwhelming that your honest ninety-per-cent interval is really a sixty-per-cent interval. A practical instrument is the pre-mortem: imagine it is three years on and the thesis has failed, and write the obituary in detail before committing. The exercise forces the low tail into view and almost always widens the interval, which is another way of saying it improves calibration on the only side that can ruin you. A margin of safety, in the value tradition, is nothing more exotic than a fractile spread expressed in price: room built in precisely because the point estimate is known to be wrong.

Third, run your own forecast-evaluation report. Borrow the Office for Budget Responsibility’s habit and schedule, perhaps once or twice a year, a deliberate review of past calls. Separate process from outcome — a good decision can produce a bad result and vice versa, and conflating the two is how investors learn exactly the wrong lessons from luck. Sort the hits and misses by whether the reasoning was sound, not by whether the trade worked. Adopt the fox’s posture: aggregate views from sources that disagree with you, hold each loosely, and update in small steps rather than lurching between certainties. The point of the review is not self-flagellation; it is the slow, unglamorous recalibration that turns a confident forecaster into an accurate one.

Three calibration disciplines for the long-term investor, shown as cards: keep score with a forecast log and your own Brier score; spread the fractiles by widening ninety-per-cent intervals and running a pre-mortem; and run your own forecast-evaluation report on a schedule, separating process from luck. — Figure 3. Three calibration disciplines for the long-term investor: keep score, widen the interval, and grade your own forecasts on a schedule. Brier 1950 · Alpert & Raiffa 1982 · Tetlock 2005 · Mellers et al. 2014.

How long-term-equity practitioners addressed it

The investors who have written most clearly about this rarely use the word calibration, but the discipline runs through their work. Michael Mauboussin — for decades a strategist at Credit Suisse and Legg Mason, latterly head of consilient research at Counterpoint Global — has built much of his writing around the idea that an investment is a probability-weighted distribution of outcomes, not a point forecast, and that the serious investor must therefore reason in expected value: probability times payoff, summed across the whole range of what might happen. In More Than You Know and Think Twice he insists on two habits the calibration literature would recognise instantly — consult the base rate of the reference class before trusting the vivid specifics of the case, and judge decisions by the quality of the process rather than the accident of the outcome. He cites Tetlock approvingly and for the obvious reason: both are arguing that good forecasting is a measurable, improvable skill, and that the first step is to make your probabilities explicit enough to be graded.

Howard Marks, co-founder of Oaktree Capital, reaches the same place from the practitioner’s side. The refrain that runs through his memos — “we cannot predict, but we can prepare” — is a calibration statement in disguise: an admission that point forecasts of the macro future are not reliable, paired with a strategy of positioning for a distribution of outcomes rather than betting on a single one. Marks writes constantly about thinking in probabilities instead of certainties, about the difference between a decision and its outcome, and about the intellectual humility of the investor who knows the limits of his own foresight. It is no accident that both men are long-horizon investors. The longer the horizon, the more forecasts compound, and the more a small, persistent overconfidence widens into a large, persistent error. Warren Buffett’s margin of safety belongs in the same family of devices — the deliberate insistence on a wide enough gap between price and estimated value that even a badly miscalibrated estimate still leaves the investor whole. Each of these is, in the end, the same move the meteorologist makes every morning and the pundit never makes at all: state the probability, keep the score, and widen the interval when the world disagrees.

Key takeaways

Calibration, not boldness, is the test of a forecast. A forecaster is worth listening to when the things she calls ninety-per-cent likely happen about nine times in ten — not when she is confident, famous or fluent. Confidence and accuracy are nearly independent, and often inversely related.
Overconfidence is the default, and it has been measured for decades. From Alpert and Raiffa’s collapsing ninety-eight-per-cent intervals to Tetlock’s experts barely beating chance, the evidence that subjective probabilities run ahead of reality is among the most robust in behavioural science.
The investor’s feedback loop is broken by design. Outcomes arrive late, noisy and confounded by luck, so the market almost never delivers the clean error signal that disciplines a weather forecaster. Calibration therefore has to be manufactured by hand.
It is a trainable skill. The Good Judgment Project showed that a short course in probabilistic thinking, plus teaming and relentless tracking, measurably improves accuracy. The improvement is real, durable, and available to anyone willing to keep score.
Buy calibration with three disciplines. Keep a written forecast log and grade it; spread your fractiles wider than feels comfortable and run a pre-mortem; and hold a scheduled forecast-evaluation report that separates sound process from lucky outcome. These are the machinery that turns a confident investor into an accurate one.

— Manish Goel, FCA / NorthPath Advisory OÜ / Tallinn, Estonia

Important.
All content on this site and in this email is journalism and education for a general audience. Nothing here constitutes investment advice or a recommendation in respect of any specific financial instrument, nor an offer or solicitation to buy or sell any security. Readers should consult an authorised financial adviser regulated in their own jurisdiction before making any investment decision.

Calibration: Tetlock, Mellers and the Good Judgment Project, and Why the Long-Term Equity Investor Should Keep Score of His Own Forecasts

The bias: confidence is not accuracy

The mechanism: a story feels like evidence

The empirical record: two decades of being barely better than chance

Two episodes: when a precise number was a miscalibrated one

The counter-measures: three disciplines that buy calibration

How long-term-equity practitioners addressed it

Key takeaways

More posts

Pledged: How to Read Promoter Encumbrance in Indian Listed Equity, and Why a Borrowed Share Is a Margin Call Waiting to Happen

The Secret Hiding Places: Joel Greenblatt’s 1997 Special-Situations Framework and Why the Long-Term Equity Investor Should Hunt Where the Forced Seller, Not the Forecast, Sets the Price

Naive Diversification: Benartzi and Thaler’s 1/n Heuristic, and Why the Long-Term Equity Investor Should Never Let the Menu Choose His Portfolio

The AAA That Wasn’t: A Retrospective on the IL&FS Group Failure