In fields including computer science and data science, it is common practice when predicting outcomes such as customer churn or image recognition to focus on variables with the highest predictive power. This often involves identifying a few “strong” signals—such as user engagement metrics for churn prediction or edge detection features in image recognition—while discarding “weak” variables that contribute less to overall model accuracy.

But making accurate predictions in economics and finance is notoriously challenging, largely because the easily exploitable alpha—or opportunities for abnormal returns—from the strongest and most obvious signals has already been found and capitalized on. As a result, economic and financial datasets are left with weaker, subtler signals that offer smaller potential gains in alpha and are not immediately apparent or easy to capture.

Chicago Booth PhD student Zhouyu Shen and Booth’s Dacheng Xiu suggest that these weak signals provide an important opportunity and that discovering how to best make use of them has become critical for anyone looking to improve predictive accuracy. A commonly used prediction method can struggle with them, their research finds—while an older, less-used model outperformed in their tests.

Weak signals are prevalent in economic data. For example, changes to personal income, the unemployment rate, or corporate bond spreads are not seemingly relevant to someone trying to predict a move in industrial production. But such data could be helpful in combination, the researchers explain. After all, personal income changes are tied to consumer demand. Corporate bond spreads signal shifts in business borrowing costs. The unemployment rate provides a read on labor dynamics. Together, these variables could start to paint a more comprehensive picture of the factors influencing industrial production.

A prediction model that works for strong signals might not necessarily work for a data set full of subtle signals, however. In this case, which machine learning models can best capture faint patterns in high-dimensional data sets (those with a lot of variables)?

Related Reading

The common approach of focusing on strong signals and eliminating most weak signals to build predictive models has an advantage: It helps avoid overfitting, which occurs when a model becomes too tailored to its training data and loses the crucial ability to generalize to new, unseen data. However, when signals are weak, this selective process can lead to errors, undermining the benefits of a parsimonious (essentially simple) model by potentially excluding subtle yet valuable information or relying on incorrectly chosen signals.

To discover which ML methods remain effective at making use of subtle signals, the researchers employed an approach that combined theoretical work, simulations, and empirical analysis.

Regression is a popular technique for economic and financial forecasting, especially the least absolute shrinkage and selection operator model, which automatically weeds out weaker variables. Shen and Xiu compared LASSO with Ridge regression, an older method that has become somewhat out of fashion. They then extended their analysis to include tree-based ML models (random forest and gradient-boosted regression trees) and neural networks.

LASSO works well when there is a mix of strong and weak signals, but it struggles with data sets that consist mostly of faint signals, as is often the case in economics and finance. In fact, the researchers find that its performance can be worse than ignoring the signals altogether. Ridge regression, on the other hand, tends to do a better job of leveraging the cumulative power of less prominent signals, according to the research.

To validate their theoretical findings, the researchers performed simulations and empirical analysis that applied the methods to six real-world data sets from finance, macroeconomics, and microeconomics. These data sets included ones used to predict market equity returns, forecast industrial production growth, and analyze crime rates and economic outcomes.

Ridge regression consistently provided predictions with higher accuracy than LASSO in data sets dominated by weak signals. This suggests Ridge regression is a more reliable tool for economic and financial prediction in these scenarios, the researchers write. Ridge keeps all variables in the model but ensures that less relevant details don’t dominate the prediction, whereas LASSO eliminates the less impactful variables altogether. This resulted in LASSO missing the subtle yet collectively significant weak signals.

The researchers’ findings highlight that in scenarios where all signals are weak, Ridge regression delivers more accurate predictions than models such as LASSO that are focused on pruning datasets down to only the strongest signals.

Random forest was the better of the tree-based methods when signals were weak, outperforming gradient boosted regression trees. Neural networks, which avoid overfitting by applying certain penalties, performed better when these penalties prevented any single part of the model from having too much influence. This approach worked more effectively than methods such as LASSO, which use penalties to eliminate the influence of many model components entirely.

The research suggests that in a landscape where the obvious signals have been fully exploited, the real advantage lies in uncovering and utilizing the subtle, often overlooked patterns within the data. Shen and Xiu’s work finds that by embracing weak signals, researchers and practitioners alike can gain a more nuanced and comprehensive understanding of economic dynamics. Finding the appropriate ML method for a dataset is a gateway to recognizing the hidden value within seemingly inconsequential data points.

More from Chicago Booth Review

More from Chicago Booth

Your Privacy
We want to demonstrate our commitment to your privacy. Please review Chicago Booth's privacy notice, which provides information explaining how and why we collect particular information when you visit our website.