Dacheng Xiu, Professor of Econometrics and Statistics

Advances in computing power have made it increasingly practicable to exploit large and often unstructured data sources such as text, audio, and video for scientific analysis. In the social sciences, textual data is the fastest growing data form in academic research. The numerical representation of text as data for statistical analysis is, in principle, ultra-high dimensional. Empirical research seeking to exploit its potential richness must also confront its dimensionality challenge. Machine learning offers a toolkit for tackling the high-dimensional statistical problem of extracting meaning from text for explanatory and predictive analysis. While the natural language processing and machine learning literature is growing increasingly sophisticated in its ability to model the subtle and complex nature of verbal communication, usage of textual analysis in empirical finance is in its infancy. Text has most commonly been used in finance to study the “sentiment” of a given document, and this sentiment has been most frequently measured by weighting terms based on a pre-specified sentiment dictionary (e.g., the Harvard-IV psychosocial dictionary) and summing these weights into document-level sentiment scores. Document sentiment scores are then used in a secondary statistical model for investigating a financial research question such as “how do asset returns associate with media sentiment?” The goal of this project is to demonstrate how machine learning techniques can be used to understand the text corpus above and beyond sentiment without relying on any pre-existing dictionaries that were originally designed for different purposes. We implement and compare a variety of text mining techniques, from the simplest “white box” machine learning benchmark we recently propose to the state-of-the-art “black-box” tools in the machine learing literature, e.g., word2vec, BERT, that are based on deep learning, to measure the sentiment, relevance, novelty of the news, as well as to explore them in less explicit contexts. We relate the analysis of news to global and individual equity returns so as to understand the economic value of news analytics. A central hurdle to  testing theories of information economics is the difficulty of quantifying information. This research agenda is to design and compare sophisticated machine learning toolkits for measuring the information content of text documents that opens new lines of research into empirical information economics.

Read working paper (SSRN)