Capitalisn’t: Mailbag—UBI, AI, and Does Luigi Believe in Free Time?
Capitalisn’t hosts Bethany McLean and Luigi Zingales answer listener questions.
Capitalisn’t: Mailbag—UBI, AI, and Does Luigi Believe in Free Time?Alice Mollon
Would you trust your financial adviser if you found out that her training included reading Reddit posts, X threads, and tabloid headlines? That’s essentially what we’re doing when placing trust in financial models built using large language models such as OpenAI’s GPT-4 and Anthropic’s Claude 3.5 Sonnet. These LLMs are typically pretrained on big, diverse datasets such as the internet’s Common Crawl, whose information comes from billions of web pages—including all their misinformation and bias. While these models undergo further training to learn to generate the kinds of responses that humans want, the vast majority of the training data are still sourced from the internet.
Chicago Booth research professional Siyan Wang and Booth’s Bradford Levy devised a way to reduce bias and ground LLMs in the content relevant to investment decisions. They have created BeanCounter, a massive business-oriented dataset drawn from company filings and corporate disclosures—documents that businesses are legally required to keep accurate. These documents are subject to regulatory scrutiny, and the companies that file them can face legal consequences for misinformation. The researchers argue that this makes them a more reliable foundation for AI training, resulting in financial LLMs that are more accurate and less likely to output toxic content.
The researchers note that to the best of their knowledge, BeanCounter is the biggest, most comprehensive compilation of business-oriented text. Datasets are often measured in tokens, which are the pieces of words and phrases from which models learn. BeanCounter includes 159 billion tokens. By contrast, while the exact number of tokens in Common Crawl varies, its raw data include at least 100 trillion tokens, which reduces to less than 15 trillion tokens after processing to eliminate duplication.
To create BeanCounter, the researchers gathered millions of public filings from the Securities and Exchange Commission’s EDGAR database, which contains documents dating back to 1996. These include annual reports (10-Ks), quarterly filings (10-Qs), and all the other reports that companies must submit to meet their disclosure obligations.
The researchers extracted and cleaned the text, removing duplicates and irrelevant content. This created a dataset providing a vast source of professionally written, factual content that has largely been untapped for AI training, according to the researchers, who write that less than 0.1 percent of BeanCounter’s content appears in existing Common Crawl–based datasets.
They have made BeanCounter publicly available through the Hugging Face Hub, an online platform for sharing and collaborating on machine-learning models, datasets, and tools. This allows other researchers and organizations to use BeanCounter to further work towards creating safer AI systems.
To measure the value the new dataset can bring to companies, Wang and Levy conducted finance-specific experiments using two tasks: named entity recognition and Financial PhraseBank. NER is a technique for identifying and categorizing key information including names, dates, and locations in text—for example, discerning when the word apple is referring to the technology company Apple and identifying Cupertino as the city where its headquarters is located. Financial PhraseBank is a sentiment classification task involving nearly 5,000 sentences from financial news, all labeled as “positive,” “neutral,” or “negative” for their likely effect on a stock’s price.
The researchers conducted continued pretraining (extending a model’s initial training by using domain-specific data to improve performance in a task) on two existing small AI models—Pythia-1.4B from the nonprofit research group EleutherAI and Phi-1.5 from Microsoft—using BeanCounter’s data. They then compared the performance of those continuously pretrained models against the original versions that had not been exposed to the specialized dataset. The results were striking: Models continually pretrained on BeanCounter showed an 18–33 percent reduction in toxic content generation while at the same time improving their performance for NER and Financial PhraseBank by up to about 4 percent.
Wang and Levy also conducted an analysis of how demographic groups are represented in their dataset. They find that BeanCounter’s business documents mentioned demographic groups at similar rates to Common Crawl but did so in less toxic ways. For instance, when the word Asian appeared in BeanCounter documents, the surrounding text was about 72 percent less toxic, on average, than in internet content. (To measure toxicity, they relied on Perspective, a state-of-the-art classifier for detecting toxic language.) This pattern held true across nearly all the demographic descriptors they examined. Potentially sensitive topics appear to be discussed in more professional and measured ways in BeanCounter’s source material, the researchers write.
BeanCounter can be used to complement existing data sources, Levy explains. It’s big enough that, on its own, it could pretrain a model such as OpenAI’s GPT-4o mini. And while it’s too small to pretrain, say, Meta’s largest 405-billion-parameter Llama models, it could be helpful as part of the “annealing” stage of pretraining during which Meta taps high-quality data from lengthy documents to improve its models’ performance.
BeanCounter can also evaluate LLMs, Levy says. Its data are grounded in facts and have associated time stamps, so BeanCounter can assess whether a model or AI system provides answers that are not just accurate but were correct and relevant at a particular point in time.
The importance of high-quality, reliable training data will almost certainly grow as AI systems become increasingly integrated into decision-making across industries. BeanCounter demonstrates that carefully curated, domain-specific datasets can lead to AI models that are both more capable and more ethically aligned, write Wang and Levy. This suggests a potential pathway for developing specialized AI systems in other professional fields, such as law or medicine, where accuracy and professional conduct are paramount.
The researchers envision a future where AI systems could learn from professional rather than social sources and deliver more reliable and unbiased insights while being more efficient and economical than their larger, general-purpose counterparts—kind of like getting investment advice from a financial adviser instead of relying on the Twitterverse.
Siyan Wang and Bradford Levy, “BeanCounter: A Low-Toxicity, Large-Scale, and Open Dataset of Business-Oriented Text,” Proceedings of the Thirty-Eighth Conference on Neural Information Processing Systems, December 2024.
Capitalisn’t hosts Bethany McLean and Luigi Zingales answer listener questions.
Capitalisn’t: Mailbag—UBI, AI, and Does Luigi Believe in Free Time?Large language models can help revolutionize how science is practiced.
How Generative AI Can Improve Scientific ExperimentsQuantitative easing may have played a part in the US financial sector’s current instability.
Did the Fed Contribute to SVB’s Collapse?The prestigious award from the American Finance Association recognizes top academics whose research has made a lasting impact on the finance field.
Example Article SwissAt the Kilts Center’s annual Case Competition, a student team leveraged LLMs to create innovative product solutions for Microsoft.
Example Article SwissThe Booth dean and professor (1939–2024) was an expert in microeconomics, strategy, and industrial organization and served in the US government.
Example Article SwissYour Privacy
We want to demonstrate our commitment to your privacy. Please review Chicago Booth's privacy notice, which provides information explaining how and why we collect particular information when you visit our website.