From Econometrics to Machine Learning

Machine Learning’s Transformation of Asset Pricing

Sep 27, 2025

Empirical asset pricing is going through a quiet revolution. For decades, econometrics gave us simple, interpretable models of risk and return. Today, machine learning is rewriting the playbook—changing how researchers and practitioners forecast returns, discover factors, and measure risk.

This isn’t just about swapping regression for a neural network. It’s about a deeper shift in how we extract information from markets: from simple factor models with a handful of variables, to adaptive systems that can process hundreds of predictors—ranging from book-to-market ratios to Reddit posts to satellite images of Walmart parking lots.

The Evolution: From Linear Models to High-Dimensional Problems

The story starts in the 1960s with the Capital Asset Pricing Model (CAPM), which claimed that the market factor alone explained expected returns. CAPM didn’t survive empirical scrutiny. By the 1990s, Fama and French expanded the model to include size and value factors, creating the now-famous three-factor model.

This gave researchers and practitioners a benchmark—but it also set off decades of “factor wars.” Momentum, profitability, liquidity, and hundreds of other anomalies entered the literature. By 2016, Harvey, Liu, and Zhu documented over 400 factors.

The two challenges became clear:

High dimensionality of covariates — hundreds of potential predictors, from traditional accounting ratios to alternative data like order books and ESG scores.
Complex, nonlinear relationships — factors don’t act independently. They interact, amplify, and sometimes reverse depending on market regimes.

Traditional linear regressions can’t keep up with this complexity.

Why Machine Learning?

Machine learning excels where econometrics falls short.

Taming high dimensionality: With hundreds of covariates, linear regressions break down. Regularization methods like Lasso, Ridge, and Elastic Net shrink noisy variables and keep the true signals.
Modeling nonlinearities: Markets are not linear. Tree-based models and neural networks uncover interactions—for example, momentum effects that depend on liquidity or size—relationships traditional models ignore.
Focusing on prediction: Econometrics values statistical significance in-sample. Machine learning targets out-of-sample Sharpe ratios and portfolio performance, aligning directly with investor goals.

A landmark study by Gu, Kelly, and Xiu (2020) showed that random forests and neural nets consistently outperformed linear regressions across hundreds of firm characteristics. The results were intuitive too—momentum and liquidity emerged as the most powerful predictors, confirming economic theory while proving the edge of modern methods.

Monthly out-of-sample stock-level prediction performance (percentage ⁠R²)

The Unified Framework: Stochastic Discount Factor Meets ML

At the heart of modern asset pricing lies the stochastic discount factor (SDF)—sometimes called the pricing kernel. In plain terms, the SDF is the tool that ensures asset prices are consistent with no-arbitrage. It tells us how to value risky payoffs: an asset’s price today equals the expected value of its future payoff, discounted by the SDF.

Mathematically, for any asset return Rₜ₊₁:

\(E[M_{t+1} R_{t+1}] = 1\)

where Mₜ₊₁ is the stochastic discount factor.

If an asset’s payoff covaries strongly with the SDF (meaning it pays off in states when the SDF is low—“bad times”), it must offer a higher expected return.
If it covaries weakly, it requires less compensation, yielding a lower expected return.

In this way, the SDF ties economic risk pricing theory directly to empirical return data.

Why the SDF is a Natural Bridge to Machine Learning

Traditional factor models (like CAPM or Fama-French) can be seen as special cases of an SDF. Each factor (market, size, value, momentum) is just a way of approximating the true SDF with a linear combination of a few variables. The limitation: linear models assume fixed relationships and only a handful of factors.

Machine learning changes the game. Instead of restricting ourselves to a few pre-specified factors, we can let ML learn a flexible mapping function 𝑓(𝑥) from firm characteristics and macro variables into the SDF.

Economists appreciate this because the SDF keeps the model grounded in theory.
Data scientists appreciate it because ML methods (neural nets, random forests, IPCA) can capture nonlinearities, interactions, and time variation that regressions miss.

Kelly, Pruitt, and Su (2019) developed Instrumented PCA (IPCA), a method for estimating time-varying SDFs using firm characteristics. Rather than picking factors by hand, IPCA learns the most informative combinations directly from the data. The payoff was enormous: their model delivered out-of-sample Sharpe ratios above 3, far beyond classical factor models.

The lesson is clear: by embedding ML within the SDF framework, we combine economic interpretability with predictive power, turning theory into a practical tool for portfolio construction.

What Works in Practice

The promise of machine learning in asset pricing isn’t just theoretical. It comes alive when you look at how investors actually apply these tools to build better portfolios. Four areas stand out as both impactful and realistic in practice.

1. Alternative and Alternative-Like Data

Financial markets today are rich in signals beyond balance sheets and price ratios. Consider order book data: short-term imbalances between buy and sell orders often foreshadow price moves. Traditional regressions might pick up average effects, but tree-based models can spot the nonlinear liquidity “holes” where these imbalances matter most—moments when a small imbalance leads to outsized moves.

Another area is economic links between firms. A supplier’s performance often spills over to its customers, and shared analyst coverage or geographic clustering can create correlated risks across companies. Machine learning models are well-suited to capture these subtle, network-driven dependencies. These are not just academic curiosities—hedge funds have turned supplier-customer momentum into profitable trading strategies in both U.S. and Chinese markets.

The takeaway is not that every dataset (tweets, parking lots, satellite photos) will produce alpha, but that the right kind of alternative data, combined with nonlinear models, can unlock economic relationships that traditional methods overlook.

2. Regularization

Adding predictors is only valuable if you can control for noise. Regularization techniques—Ridge, Lasso, Elastic Net—are essential. Ridge spreads weights thinly across many signals, while Lasso zeroes out the weakest predictors. The result is not just statistical neatness but portfolios that are more stable and less prone to overfitting. In practice, regularization means you can explore broader signal sets without blowing up your risk budget.

3. Ensemble Methods

No single model dominates across market regimes. That’s why ensemble learning—blending forecasts from different models—is so powerful. A portfolio manager who averages predictions from elastic net, random forest, and gradient boosting typically achieves sharper, more consistent returns than any one model alone. In investing, this translates into portfolios with smoother drawdowns and higher long-term Sharpe ratios.

4. Interpretability Tools

Machine learning need not be a black box. Tools like permutation importance and SHAP values allow quants to see which variables drive predictions and how their effects vary across assets. This doesn’t just satisfy curiosity—it reconnects machine learning with economic intuition. If your model says liquidity and momentum are dominant drivers, and the interpretation tools confirm it, you gain confidence that your forecasts are anchored in known market dynamics rather than spurious noise.

From Classical to Modern Portfolio Construction

The contrast between classical econometrics and modern machine learning becomes clearest in how we construct factor portfolios.

In the classical world, the Fama–French model defined value as a simple long–short strategy: buy high book-to-market stocks, short low book-to-market stocks, and rebalance periodically. The portfolio was built on a single characteristic, estimated with straightforward regressions. This approach had elegance and interpretability, but it left enormous amounts of data on the table.

The modern machine learning approach starts with the same intuition—value matters—but refuses to stop there. A Lasso regression, for instance, can simultaneously consider hundreds of predictors: book-to-market, momentum, liquidity measures, volatility, analyst sentiment, even textual tone from earnings calls. The regularization step automatically shrinks or discards weak signals, leaving behind a weighted combination of predictors that adapts as conditions change.

The result is not a “value factor” in the old sense, but a composite signal: a portfolio that responds to many dimensions of firm characteristics at once. Out-of-sample, these ML-driven portfolios consistently show stronger returns, higher Sharpe ratios, and better resilience across market cycles than their classical counterparts. Where the Fama–French framework gave us clarity, the ML framework gives us adaptability and robustness.

Practical Lessons for Quants

The shift from econometrics to machine learning in asset pricing isn’t about abandoning the past. It’s about upgrading the toolkit while holding onto what still works.

Don’t fight high dimensionality—leverage it. Use regularization and ensemble methods to manage large predictor sets instead of forcing sparse models.
Use alternative data strategically. Some signals (like supplier networks) have proven staying power, while others fade quickly. Treat them as short-term edges, not permanent factors.
Stay grounded in theory. The stochastic discount factor framework ensures that machine learning predictions remain interpretable and economically consistent.
Redefine success. What matters isn’t an in-sample R², but out-of-sample Sharpe ratios and portfolio performance.
Keep it explainable. Tools like SHAP values and permutation importance help bridge the gap between black-box models and economic intuition, making results both practical and defensible.

Classical models gave us a language for risk premia and a framework for understanding markets. Machine learning doesn’t replace that foundation; it builds on it, offering the flexibility to capture complexity in ways linear models never could.

The reality is clear: empirical finance has already crossed the threshold. For today’s quants and researchers, machine learning is no longer optional—it is the new baseline. The firms that embrace this shift will not only forecast better but also understand markets more deeply, turning complexity into opportunity.

HarmoniQ Insights

Sep 28

Only one suggestion - I believe a more appropriate title is “From ‘Classical Hypothesis-Driven Econometrics’ to ‘Machine Learning Econometrics’” because I still believe you apply the scientific method but it applies ex-post. It took many years for me to embrace ML, we used to just disparage as data mining. It actually IS data mining. But the analogy is extended. Data mining to discover gold is OK. You just have to test the gold for its purity, potentially refine it, etc. Eventually you can meld refine it to 24k through rigorous refinement.

Alina, you’ve done an excellent job of clearly explaining the practical application of all this quant statistical jargon!

Expand full comment

Alina Khay

Discussion about this post

Ready for more?