There are data about practically everything these days, and they can be used to try to answer any number of questions. Do clinical trials really show a drug works? Can surveys really signal who’s going to win the next election? Can a financial manager really predict a winning portfolio?
As powerful as data are, adjustments made for missing information—the people who drop out of drug trials, the questions people don’t answer in polling, incomplete corporate financial reports—may dramatically skew the results of predictive models, according to University of Bonn’s Joachim Freyberger and Björn Höppner (a PhD student), Washington University in St. Louis’s Andreas Neuhierl, and Chicago Booth’s Michael Weber.
They propose an improved method for handling missing data, having tested it against two popular existing approaches in a practical application of data, namely predicting stock returns. The results indicate that their method provides a consistent edge.
To compare the three methods, the researchers obtained a database of US stock and balance-sheet data from 1978 to 2021. The data set started out with 2.4 million observations, or rows, each with 82 variables covering trading volume, accounting information, momentum indicators, and the like. As is the case with many data sets, it wasn’t complete: some rows didn’t have values for all 82 variables.
The first of the two widely used methods, the “complete cases” approach, drops all incomplete observations, although this violates a cardinal rule of data analysis: “Thou shalt not throw data away.” This approach required that the researchers exclude rows of data where any information was missing—for example, if a stock was missing trading volume for one month, the complete cases method required dropping all the data collected for that stock that month. After the researchers did this, just 10 percent of the data remained. Most of the dropped instances were missing values for five variables or fewer.
The other well-known method, “mean imputation,” keeps all of the observations but creates biases. It replaces missing values with an average of all the data set’s existing data points for a given variable and month. But the missing data might include extreme values that could make a significant difference in prediction models. For example, say there’s a housing database, but most of the high-end houses in it are sold by a realtor who never lists the square footage. If analysts replaced the missing data with the average square footage of all housing, they would most likely undershoot and skew their model’s predictions of market values.