Markov Chains: Why the Future Depends Only on the Present

Markov chain cover: a network of states linked by transition arrows.

Afternoon Edition — Mental Models · Essay No. 15

Every forecasting argument that begins with the phrase “given how far this has already run” smuggles in a hidden premise: that the path a thing has travelled tells you something about the path it will travel next. Sometimes it does. Very often it does not. The discipline of separating the two cases is one of the oldest results in probability, and it is owed to a stubborn, combative Russian mathematician who set out to prove a point about poetry and ended up handing investors a tool for thinking about credit, cycles, and corporate decay. The tool is the Markov chain, and the property at its heart — that the future depends on the present alone, not on the road taken to reach it — is among the most useful and most abused ideas in the analyst’s kit.

The model

A Markov chain is a system that moves between a finite set of states, where the probability of the next state depends only on the current state and not on the sequence of states that preceded it. That single restriction — formally, the Markov property, or “memorylessness” — is the whole of the idea. A weather model with states {sunny, cloudy, rainy} is Markovian if tomorrow’s weather depends only on today’s, regardless of the week that led up to it. A company’s credit rating is treated as Markovian if the chance of being downgraded next year depends only on its rating now, not on whether it arrived there from above or below. The one-sentence form is worth memorising: given the present, the future is independent of the past.

The model is named for Andrei Andreyevich Markov (1856–1922), a professor at St. Petersburg and a former student of Pafnuty Chebyshev. His foundational paper, “Extension of the law of large numbers to quantities depending on each other,” appeared in 1906 in the Izvestiya of the Physico-Mathematical Society at Kazan University (2nd series, vol. 15). The motivation was not physics or finance but a polemic. A rival, Pavel Nekrasov, had argued that the law of large numbers required independence, and from this leapt to the theological claim that statistical regularities in human behaviour proved free will. Markov, an atheist and a famous contrarian — colleagues called him “Andrei the Furious” — set out to demolish the premise by proving that the law of large numbers holds perfectly well for dependent variables, provided the dependence has the memoryless structure he described. He had not merely won an argument; he had defined a new object that would outlast every party to the quarrel.

The mechanism

To make a Markov chain operational you need two things: a list of states, and a transition matrix — a table whose entry in row i, column j gives the probability of moving from state i to state j in one step. Each row sums to one, because the system must go somewhere. Everything you can ask of the chain is answered by manipulating this matrix. The probability of being in each state two steps ahead is found by multiplying the matrix by itself; the distribution after n steps is the matrix raised to the nth power. This is the quiet engine of the whole subject: long-run behaviour is encoded in short-run probabilities, and matrix multiplication unrolls one into the other.

It is worth being precise about what the memoryless assumption buys and what it costs. It buys enormous tractability. A process with full memory would require, in principle, a separate probability for every possible history — an object that grows without bound as the sequence lengthens and that no finite dataset could ever estimate. The Markov restriction collapses that explosion into a single fixed table, turning an intractable problem into ordinary arithmetic. The cost is that the assumption is an idealisation almost everywhere it is applied; few real systems are perfectly memoryless. The art, as with every model in this series, is to use the chain where the residual memory is small enough to ignore and to abandon it the moment that memory becomes the very thing you are trying to price.

A three-state Markov transition diagram with states labelled Expansion, Slowdown and Contraction, connected by arrows annotated with transition probabilities that each sum to one.
Figure 1. A three-state chain. The next state depends only on the current node; the arrows leaving any state carry probabilities that sum to one. How the system arrived at a node carries no further information.

Two consequences make the chain valuable rather than merely tidy. The first is the existence, under mild conditions, of a stationary distribution: a set of state probabilities that no longer changes when the matrix is applied again. If the chain is “irreducible” (every state is reachable from every other) and “aperiodic” (it does not cycle in lockstep), then no matter where it starts, the distribution of where it is converges to this single equilibrium. The starting point washes out. The system forgets its origin. This is the mathematical content of a phrase investors use loosely — “in the long run” — made precise.

The second consequence is the absorbing state: a state the chain can enter but never leave, whose row in the matrix is a one on the diagonal and zeros elsewhere. Bankruptcy is the canonical example. A company can migrate up and down the rating scale for decades, but default is a trapdoor. Chains with absorbing states do not settle into a lively equilibrium; they drain, sooner or later, into the trap. The mathematics that tells you how long the draining takes — expected time to absorption — is precisely the mathematics a long-term owner of a fragile business ought to be running, if only in the back of the mind.

A convergence line chart showing three coloured trajectories that begin at very different starting probabilities and all settle onto the same horizontal stationary level after a number of steps.
Figure 2. Convergence to the stationary distribution. Three chains begin from very different states; under irreducibility and aperiodicity they forget where they started and settle onto a common equilibrium. “The long run” made precise.

The empirical record

The most consequential commercial application of Markov chains in finance sits inside the credit-rating agencies. Each year S&P Global Ratings and Moody’s publish a transition matrix built from decades of issuer histories: the empirical frequency with which a company rated, say, BBB at the start of a year is found a year later still at BBB, or upgraded to A, or downgraded to BB, or in default. These tables are explicitly Markovian in construction — the probability of next year’s rating is read off this year’s rating — and they share three features worth dwelling on. They are strongly diagonal-dominant: the single most likely outcome, by a wide margin, is that a rating stays put. The probabilities fall away as you move from the diagonal, so large jumps are rarer than small ones. And default behaves as an absorbing state, capturing issuers permanently.

A stylised one-year credit-rating transition matrix shown as a grid, with the strongest shading running down the diagonal where ratings stay unchanged, fading away from the diagonal, and a dark absorbing default column on the right.
Figure 3. A stylised one-year rating-transition matrix. Mass concentrates on the diagonal (ratings persist), thins with distance, and drains into an absorbing default column. Multiplying the matrix by itself gives the multi-year picture.

The structure pays a practical dividend: a multi-year migration is, to a first approximation, the one-year matrix raised to a power. The chance a single-A issuer defaults within five years is computed by multiplying the one-year matrix by itself five times and reading the default column. This is why credit desks can speak of cumulative default rates across the rating spectrum at all. It is also where the honest caveat lives. Academic testing — for instance Jafry and Schuermann’s “Measurement, estimation and comparison of credit migration matrices” in the Journal of Banking & Finance (vol. 28, 2004), and a parallel literature at the U.S. Office of the Comptroller of the Currency — has shown that real rating histories violate the strict Markov assumption in two ways. There is momentum: a company recently downgraded is more likely to be downgraded again than its current rating alone would predict, because the path carries information the rating does not. And the matrix is not constant through time; it steepens in recessions. The model is a serviceable approximation over a few years and a misleading one over many. That is the correct relationship to have with any Markov chain: use it, and know exactly where it lies to you.

Two historical episodes

The first episode is the chain’s own birth certificate. Having proved his theorem in the abstract in 1906, Markov wanted a real sequence of dependent events to demonstrate it on, and he chose, with characteristic perversity, a poem. In a paper of 1913 read to the Imperial Academy of Sciences in St. Petersburg, he took the first 20,000 letters of Pushkin’s Eugene Onegin, stripped out spaces and punctuation, and classified each letter as a vowel or a consonant. Counting by hand, he found that a vowel was followed by another vowel only about 13 percent of the time, but followed a consonant about 66 percent of the time. The letters were plainly not independent — the language has structure — yet the long-run frequencies obeyed the law of large numbers exactly as his theorem required. It was the first empirical Markov chain ever estimated, and it established the template every later application would follow: define the states, count the transitions, and let the matrix speak.

The second episode is the most valuable Markov chain ever built. In 1998 two Stanford graduate students, Sergey Brin and Lawrence Page, published “The Anatomy of a Large-Scale Hypertextual Web Search Engine.” Their ranking algorithm, PageRank, modelled a hypothetical web surfer who, at each page, either follows a random outbound link (with probability about 0.85) or, occasionally, grows bored and jumps to a random page anywhere (with probability about 0.15). That “random surfer” is a Markov chain over the entire web: the states are pages, the transition probabilities are the links, and the small jump probability — the “damping factor” — is exactly the device that makes the chain irreducible and aperiodic, guaranteeing a unique stationary distribution. The importance of a page is defined as the long-run fraction of time the surfer spends on it: its weight in the stationary distribution. A company eventually worth more than a trillion dollars was, at its technical core, a single enormous application of Markov’s 1906 theorem. The lesson for an analyst is not about search engines. It is that the stationary distribution — where a system spends its time once the starting conditions have washed out — is frequently the thing of real economic value, and it is computable from nothing more than the one-step transition probabilities.

Application to long-term equity investing

The Markov chain is not a stock-selection formula, and treating it as one is a category error. Its value to a long-term owner of businesses is as a discipline for thinking about persistence, reversion, and ruin. Three operating disciplines follow.

First: ask whether the variable you care about is memoryless, and act differently depending on the answer. Some business quantities are strongly path-dependent — a brand built over a century, an installed base, a regulatory licence — and for these the past genuinely does constrain the future; recent history is informative. Others behave far more like memoryless chains snapping back toward an equilibrium: commodity margins, the return on capital of an undifferentiated manufacturer, the growth rate of a mature firm. The single most common analytical error is to extrapolate the recent path of a memoryless variable, paying up for a string of good years that the underlying chain will, with high probability, revert. The chain forces the question: is this a quantity whose path informs its future, or one whose present alone does?

Second: think in transition probabilities, not point forecasts. A Markov view replaces “earnings will grow 12 percent” with a distribution: given the firm is in its current state, here are the probabilities of the states it could occupy in three years, and here is what each would be worth. This is more honest and more useful, because it surfaces the tails. It also reframes the analyst’s job as estimating a transition matrix — how sticky is this competitive position, how quickly do firms like this one decay — rather than producing a single confident number that the next downgrade will embarrass. The practical test is whether you can state, for any business you own, the rough probability that it is in a materially worse competitive state three years from now, and what that state would do to its economics. If the question feels unanswerable, the position has been sized on a forecast rather than on a distribution, and the difference tends to announce itself at the worst possible moment.

Third: respect the absorbing state. The one transition from which there is no return is permanent loss of capital — dilutive recapitalisation, covenant breach, insolvency. Because absorption is irreversible, even a small annual probability of it compounds into near-certainty over a long horizon, and it cannot be offset by good outcomes elsewhere in the chain, which is the formal version of the maxim that the first job is to survive. A portfolio of businesses each carrying a modest annual chance of the trapdoor is a chain quietly draining toward it. Position sizing, balance-sheet conservatism, and a margin of safety are, in this language, simply ways of keeping the absorbing-state probability low enough that time remains your ally rather than your executioner.

How the long-term equity tradition has used it

The most rigorous modern exponent of Markov-style reasoning in equity analysis is Michael Mauboussin. In “The Base Rate Book” (Credit Suisse, September 2016) and the earlier “Death, Taxes, and Reversion to the Mean” (2007), he assembled multi-decade transition data on corporate performance — sales growth, operating margin, return on invested capital — and showed how rapidly each reverts toward the mean. His central finding is a transition-matrix finding in all but name: high return-on-capital is far less persistent than narratives suggest, sales growth has almost no year-to-year memory, and the correct prior for any firm is the base rate of its reference class rather than the extrapolation of its own recent trajectory. Mauboussin’s insistence that an analyst begin with the population’s transition frequencies and adjust only cautiously for the specific company is the discipline of the chain applied to fundamentals: where a business is now constrains, but does not determine, where it goes, and the constraint is best measured across many comparable histories.

The second practitioner is Howard Marks, whose framework in “Mastering the Market Cycle” (2018) is Markovian in spirit even where the mathematics stays implicit. Marks refuses point forecasts and insists instead on assessing “where we stand” in the cycle — the current state — and reasoning from it to a distribution of possible next states. His repeated formulation, that “we cannot know where we’re going, but we ought to know where we are,” is the Markov property turned into temperament: the present position, read honestly, is the most informative input to the conditional distribution of what comes next, and the past path matters chiefly insofar as it has set the present state. As a builder of one of the largest distressed-credit firms, Marks also spent a career pricing exactly the rating-migration and default chains described above, which is perhaps why the cyclical, state-conditional habit of mind comes through so plainly in his memos. Between Mauboussin’s base rates and Marks’s cycle positioning, the long-term tradition has, without always naming it, made the Markov chain a working part of judgement.

Key takeaways

  • The Markov property is the whole idea: given the present state, the future is independent of the past. Before extrapolating a trend, ask whether the variable actually carries memory or merely looks as though it does.
  • The transition matrix is the object to estimate. Long-run behaviour — persistence, reversion, equilibrium — is encoded in one-step probabilities and unrolled by matrix multiplication. Think in distributions of next states, not single forecasts.
  • Stationary distributions reveal where value lives. Once starting conditions wash out, a system spends its time according to a fixed distribution computable from the transition probabilities — the insight that made PageRank, and that disciplines “in the long run” talk.
  • The absorbing state dominates a long horizon. A small annual probability of permanent loss compounds toward certainty and cannot be averaged away; survival precedes optimisation.
  • Treat the model loosely. Real chains show momentum and shift in recessions; the strict Markov assumption is a useful few-year approximation and a misleading long-run one. Use it, and know where it lies to you.

Markov proved his theorem to win an argument about free will, and demonstrated it on a poem. More than a century later the same structure prices corporate credit, ranks the web, and — for the patient owner of businesses — quietly insists on the two questions that matter most: where, honestly, are we now, and which transition can we not afford to take.

— Manish Goel, FCA / NorthPath Advisory OÜ / Tallinn, Estonia

Important.
All content on this site and in this email is journalism and education for a general audience. Nothing here constitutes investment advice or a recommendation in respect of any specific financial instrument, nor an offer or solicitation to buy or sell any security. Readers should consult an authorised financial adviser regulated in their own jurisdiction before making any investment decision.