Specialized Data Could Improve AI

Would you trust your financial adviser if you found out that her training included reading Reddit posts, X threads, and tabloid headlines? That’s essentially what we’re doing when placing trust in financial models built using large language models such as OpenAI’s GPT-4 and Anthropic’s Claude 3.5 Sonnet. These LLMs are typically pretrained on big, diverse datasets such as the internet’s Common Crawl, whose information comes from billions of web pages—including all their misinformation and bias. While these models undergo further training to learn to generate the kinds of responses that humans want, the vast majority of the training data are still sourced from the internet.

Chicago Booth research professional Siyan Wang and Booth’s Bradford Levy devised a way to reduce bias and ground LLMs in the content relevant to investment decisions. They have created BeanCounter, a massive business-oriented dataset drawn from company filings and corporate disclosures—documents that businesses are legally required to keep accurate. These documents are subject to regulatory scrutiny, and the companies that file them can face legal consequences for misinformation. The researchers argue that this makes them a more reliable foundation for AI training, resulting in financial LLMs that are more accurate and less likely to output toxic content.

The researchers note that to the best of their knowledge, BeanCounter is the biggest, most comprehensive compilation of business-oriented text. Datasets are often measured in tokens, which are the pieces of words and phrases from which models learn. BeanCounter includes 159 billion tokens. By contrast, while the exact number of tokens in Common Crawl varies, its raw data include at least 100 trillion tokens, which reduces to less than 15 trillion tokens after processing to eliminate duplication.

To create BeanCounter, the researchers gathered millions of public filings from the Securities and Exchange Commission’s EDGAR database, which contains documents dating back to 1996. These include annual reports (10-Ks), quarterly filings (10-Qs), and all the other reports that companies must submit to meet their disclosure obligations.

The effects of high-quality training data

The researchers extracted and cleaned the text, removing duplicates and irrelevant content. This created a dataset providing a vast source of professionally written, factual content that has largely been untapped for AI training, according to the researchers, who write that less than 0.1 percent of BeanCounter’s content appears in existing Common Crawl–based datasets.

They have made BeanCounter publicly available through the Hugging Face Hub, an online platform for sharing and collaborating on machine-learning models, datasets, and tools. This allows other researchers and organizations to use BeanCounter to further work towards creating safer AI systems.

To measure the value the new dataset can bring to companies, Wang and Levy conducted finance-specific experiments using two tasks: named entity recognition and Financial PhraseBank. NER is a technique for identifying and categorizing key information including names, dates, and locations in text—for example, discerning when the word apple is referring to the technology company Apple and identifying Cupertino as the city where its headquarters is located. Financial PhraseBank is a sentiment classification task involving nearly 5,000 sentences from financial news, all labeled as “positive,” “neutral,” or “negative” for their likely effect on a stock’s price.

The researchers conducted continued pretraining (extending a model’s initial training by using domain-specific data to improve performance in a task) on two existing small AI models—Pythia-1.4B from the nonprofit research group EleutherAI and Phi-1.5 from Microsoft—using BeanCounter’s data. They then compared the performance of those continuously pretrained models against the original versions that had not been exposed to the specialized dataset. The results were striking: Models continually pretrained on BeanCounter showed an 18–33 percent reduction in toxic content generation while at the same time improving their performance for NER and Financial PhraseBank by up to about 4 percent.

Wang and Levy also conducted an analysis of how demographic groups are represented in their dataset. They find that BeanCounter’s business documents mentioned demographic groups at similar rates to Common Crawl but did so in less toxic ways. For instance, when the word Asian appeared in BeanCounter documents, the surrounding text was about 72 percent less toxic, on average, than in internet content. (To measure toxicity, they relied on Perspective, a state-of-the-art classifier for detecting toxic language.) This pattern held true across nearly all the demographic descriptors they examined. Potentially sensitive topics appear to be discussed in more professional and measured ways in BeanCounter’s source material, the researchers write.

BeanCounter can be used to complement existing data sources, Levy explains. It’s big enough that, on its own, it could pretrain a model such as OpenAI’s GPT-4o mini. And while it’s too small to pretrain, say, Meta’s largest 405-billion-parameter Llama models, it could be helpful as part of the “annealing” stage of pretraining during which Meta taps high-quality data from lengthy documents to improve its models’ performance.

BeanCounter can also evaluate LLMs, Levy says. Its data are grounded in facts and have associated time stamps, so BeanCounter can assess whether a model or AI system provides answers that are not just accurate but were correct and relevant at a particular point in time.

The importance of high-quality, reliable training data will almost certainly grow as AI systems become increasingly integrated into decision-making across industries. BeanCounter demonstrates that carefully curated, domain-specific datasets can lead to AI models that are both more capable and more ethically aligned, write Wang and Levy. This suggests a potential pathway for developing specialized AI systems in other professional fields, such as law or medicine, where accuracy and professional conduct are paramount.

The researchers envision a future where AI systems could learn from professional rather than social sources and deliver more reliable and unbiased insights while being more efficient and economical than their larger, general-purpose counterparts—kind of like getting investment advice from a financial adviser instead of relying on the Twitterverse.

Works Cited

Siyan Wang and Bradford Levy, “BeanCounter: A Low-Toxicity, Large-Scale, and Open Dataset of Business-Oriented Text,” Proceedings of the Thirty-Eighth Conference on Neural Information Processing Systems, December 2024.

More from Chicago Booth Review

Capitalisn’t: Mailbag—UBI, AI, and Does Luigi Believe in Free Time?

Capitalisn’t hosts Bethany McLean and Luigi Zingales answer listener questions.

CBR - Capitalisnt

How Generative AI Can Improve Scientific Experiments

Large language models can help revolutionize how science is practiced.

CBR - Artificial Intelligence

Did the Fed Contribute to SVB’s Collapse?

Quantitative easing may have played a part in the US financial sector’s current instability.

CBR - Finance

More from Chicago Booth

Steve Kaplan Elected to AFA Society of Fellows

The prestigious award from the American Finance Association recognizes top academics whose research has made a lasting impact on the finance field.

News

Where Product Management Meets AI

At the Kilts Center’s annual Case Competition, a student team leveraged LLMs to create innovative product solutions for Microsoft.

James M. Kilts Center for Marketing

In Memoriam: John P. ‘Jack’ Gould

The Booth dean and professor (1939–2024) was an expert in microeconomics, strategy, and industrial organization and served in the US government.

Faculty Impact

NECESSARY COOKIES These cookies are essential to enable the services to provide the requested feature, such as remembering you have logged in.	ALWAYS ACTIVE
	Accept \| Reject
PERFORMANCE AND ANALYTIC COOKIES These cookies are used to collect information on how users interact with Chicago Booth websites allowing us to improve the user experience and optimize our site where needed based on these interactions. All information these cookies collect is aggregated and therefore anonymous.
FUNCTIONAL COOKIES These cookies enable the website to provide enhanced functionality and personalization. They may be set by third-party providers whose services we have added to our pages or by us.
TARGETING OR ADVERTISING COOKIES These cookies collect information about your browsing habits to make advertising relevant to you and your interests. The cookies will remember the website you have visited, and this information is shared with other parties such as advertising technology service providers and advertisers.
SOCIAL MEDIA COOKIES These cookies are used when you share information using a social media sharing button or “like” button on our websites, or you link your account or engage with our content on or through a social media site. The social network will record that you have done this. This information may be linked to targeting/advertising activities.

Specialized Data Could Improve AI

Creating a focused dataset to train models led to higher accuracy and lower toxicity.

The effects of high-quality training data

More from Chicago Booth Review

Capitalisn’t: Mailbag—UBI, AI, and Does Luigi Believe in Free Time?

How Generative AI Can Improve Scientific Experiments

Did the Fed Contribute to SVB’s Collapse?

Related Chicago Booth Review Topics

More from Chicago Booth

Steve Kaplan Elected to AFA Society of Fellows

Where Product Management Meets AI

In Memoriam: John P. ‘Jack’ Gould

Related Chicago Booth Topics

Manage Cookie Preferences

Specialized Data Could Improve AI

Creating a focused dataset to train models led to higher accuracy and lower toxicity.

The effects of high-quality training data

More from Chicago Booth Review

Capitalisn’t: Mailbag—UBI, AI, and Does Luigi Believe in Free Time?

How Generative AI Can Improve Scientific Experiments

Did the Fed Contribute to SVB’s Collapse?

Related Chicago Booth Review Topics

More from Chicago Booth

Steve Kaplan Elected to AFA Society of Fellows

Where Product Management Meets AI

In Memoriam: John P. ‘Jack’ Gould

Related Chicago Booth Topics