New Analytics via Machine Learning

Dacheng Xiu, Professor of Econometrics and Statistics

Advances in computing power have made it increasingly practicable to exploit large and often unstructured data sources such as text, audio, and video for scientific analysis. In the social sciences, textual data is the fastest growing data form in academic research. The numerical representation of text as data for statistical analysis is, in principle, ultra-high dimensional. Empirical research seeking to exploit its potential richness must also confront its dimensionality challenge. Machine learning offers a toolkit for tackling the high-dimensional statistical problem of extracting meaning from text for explanatory and predictive analysis. While the natural language processing and machine learning literature is growing increasingly sophisticated in its ability to model the subtle and complex nature of verbal communication, usage of textual analysis in empirical finance is in its infancy. Text has most commonly been used in finance to study the “sentiment” of a given document, and this sentiment has been most frequently measured by weighting terms based on a pre-specified sentiment dictionary (e.g., the Harvard-IV psychosocial dictionary) and summing these weights into document-level sentiment scores. Document sentiment scores are then used in a secondary statistical model for investigating a financial research question such as “how do asset returns associate with media sentiment?” The goal of this project is to demonstrate how machine learning techniques can be used to understand the text corpus above and beyond sentiment without relying on any pre-existing dictionaries that were originally designed for different purposes. We implement and compare a variety of text mining techniques, from the simplest “white box” machine learning benchmark we recently propose to the state-of-the-art “black-box” tools in the machine learing literature, e.g., word2vec, BERT, that are based on deep learning, to measure the sentiment, relevance, novelty of the news, as well as to explore them in less explicit contexts. We relate the analysis of news to global and individual equity returns so as to understand the economic value of news analytics. A central hurdle to testing theories of information economics is the difficulty of quantifying information. This research agenda is to design and compare sophisticated machine learning toolkits for measuring the information content of text documents that opens new lines of research into empirical information economics.

Read working paper (SSRN)

NECESSARY COOKIES These cookies are essential to enable the services to provide the requested feature, such as remembering you have logged in.	ALWAYS ACTIVE
	Accept \| Reject
PERFORMANCE AND ANALYTIC COOKIES These cookies are used to collect information on how users interact with Chicago Booth websites allowing us to improve the user experience and optimize our site where needed based on these interactions. All information these cookies collect is aggregated and therefore anonymous.
FUNCTIONAL COOKIES These cookies enable the website to provide enhanced functionality and personalization. They may be set by third-party providers whose services we have added to our pages or by us.
TARGETING OR ADVERTISING COOKIES These cookies collect information about your browsing habits to make advertising relevant to you and your interests. The cookies will remember the website you have visited, and this information is shared with other parties such as advertising technology service providers and advertisers.
SOCIAL MEDIA COOKIES These cookies are used when you share information using a social media sharing button or “like” button on our websites, or you link your account or engage with our content on or through a social media site. The social network will record that you have done this. This information may be linked to targeting/advertising activities.

Manage Cookie Preferences

New Analytics via Machine Learning