Is ChatGPT Confident About Its Answer or Just Bluffing?
A meter could calibrate the uncertainty of AI’s predictions.
- By
- April 10, 2025
- CBR - Artificial Intelligence
A meter could calibrate the uncertainty of AI’s predictions.
Try asking ChatGPT to diagnose why your cat has suddenly stopped eating the expensive canned food she loves right after you bought $100 worth of it. The chatbot might state five possible causes—but ask the same question again, perhaps worded slightly differently, and you might get different answers. This shifting response raises a crucial question: How can we know when to trust predictions made by artificial intelligence?
Chicago Booth research professional Jungeum Kim, University of Chicago PhD student Sean O’Hagan, and Booth’s Veronika Ročková have developed a way to measure and convey how certain or uncertain AI systems are about their answers. Their method, called “conformal tree,” could help end users know when to trust AI’s predictions and when to be more cautious.
Modern AI systems such as ChatGPT are black boxes—we can see what goes into them (our questions) and what comes out (their answers), but we can’t examine the vast amount of data on which the systems were trained or the complex algorithms inside them. This opacity makes it difficult to know how confident we should be in their predictions.
Think of it as getting advice from a very smart friend who can’t always explain how they know what they know. Sometimes they’re completely certain of something, and other times they’re making an educated guess—but how do you know which type of advice they’re giving? With AI, you can’t look in a model’s training data to try to find that out.
Instead, the researchers created a system that works like a sophisticated confidence meter wrapped around the AI. The meter calibrates the uncertainty of the AI’s predictions by utilizing a smaller, specialized dataset relevant to the specific type of questions being asked.
Specialized datasets are often used to fine-tune AI systems so that they will perform better for a specific task. In this case, however, the method uses the smaller dataset for calibration.
As AI systems become increasingly integrated into our daily lives and decision-making processes, understanding when to trust their predictions becomes crucial.
The dataset includes the correct answers to the questions being posed, and the system measures how far the AI’s predictions deviate from those correct answers. It merges prompts with similar characteristics together so that, within each group, the AI model performs similarly well, or similarly poorly.
For any new question, the system reports a set of plausible answers. It tailors the number of possible answers according to the group into which the new query falls. The system then uses the AI’s performance on the specialized dataset to estimate degrees of prediction uncertainty for new queries. Better performance leads to fewer, more precise answers. Where the AI performs poorly on the calibration data, uncertainty increases and the system expands to include more options. In doing so, it effectively communicates that the AI is unsure of the answer so needs the user to consider a wider range of possibilities.
The number of options reported also depends on the quality of the prompt. If the query is vague or unclear, that could send the input into a group with poorer performance, resulting in more options. While being adaptive in this way, the system still maintains a high prespecified probability of conveying the right answer.
The researchers say that the conformal tree’s ability to self-group data points is a key innovation of their method—and its focus on grouping allows more precise measures of how well predictions can be trusted, since some of the data subsets may have more consistent predictions than others. Moreover, the system can then quantify uncertainty for new inputs by leveraging the patterns it has learned from similar cases in the calibration dataset.
Kim, O’Hagan, and Ročková tested their approach on two real-world applications using OpenAI’s GPT-4o model. In one example, they examined how well GPT-4o could predict which US state a legislator represented on the basis of that person’s political ideology. The researchers derived the ideology from both the legislative member’s voting patterns and scores from Voteview, a project that calculates and visualizes US politicians’ positions. They quantified the ideology of legislative members using numerical scores derived from their voting patterns, obtained from the Voteview database.
Their method proved effective in this exercise—in one example, while traditional approaches suggested the legislator could be from any of 34 states, their conformal tree method narrowed it down to just 19 states, including the correct answer, which was Indiana.
The second application of their method tested its ability to diagnose skin conditions from descriptions of symptoms. It output small sets of predicted diagnoses while maintaining reliability. In 96 percent of cases, the researchers’ method provided the same size or smaller sets of predicted diagnoses than traditional approaches.
Crucially, when the AI was likely to “hallucinate,” or make unreliable predictions, the system automatically widened its range of possibilities, essentially raising a cautionary flag.
The technique was able to determine combinations of symptoms where the AI’s diagnoses were relatively accurate, and other combinations where the AI was completely unreliable. For example, when the symptoms entered in the prompt corresponded to psoriasis, which is a common condition, the AI offered just 1.77 possible diagnoses, on average. But for pityriasis rubra pilaris, a relatively rare disease that is harder to diagnose and that has fewer distinguishing features, its prediction sets contained every possible disease. “We were able to force ChatGPT into saying ‘I do not know,’ which is a rare quality,” explains Ročková.
The implications extend beyond medical diagnosis and political analysis, according to the research. As AI systems become increasingly integrated into our daily lives and decision-making processes, understanding when to trust their predictions becomes crucial. This framework offers a practical way to add responsible uncertainty quantification to existing AI systems, helping users make more informed decisions.
“Our paper is not meant to encourage users to rely on ChatGPT for predictions. It is only meant to illustrate that if one were to do so, proper accounting for uncertainty is needed,” the researchers write. Receiving clear communication about uncertainty, they explain, helps users better understand both the capabilities and limitations of AI predictions.
Jungeum Kim, Sean O’Hagan, and Veronika Ročková, “Adaptive Uncertainty Quantification for Generative AI,” Preprint, arXiv, August 2024, arXiv: 2408.08990 v1.
Your Privacy
We want to demonstrate our commitment to your privacy. Please review Chicago Booth's privacy notice, which provides information explaining how and why we collect particular information when you visit our website.