You Should Run More Experiments

When we develop business strategies, we often rely on what we believe has worked in the past. Unfortunately, we’re often wrong.

A better way to make a decision with financial consequences is to conduct an experiment. Obviously, the idea of doing experiments is really old. Think of Galileo in the late 1500s at the top of the Leaning Tower of Pisa with two spheres of different sizes, dropping them off the side to test theories of gravity.

Experiments in the business world have been around for a long time, too, but they weren’t that prevalent in the past. In the 1960s, for example, there was a debate about whether arranging a particular grocery-store item to have a lot of shelf facings—identical products with the label turned out toward the consumer—would cause customers to be more likely to buy the product, or to buy more of it.

That’s hard to tease apart, because things that sell in higher volume are going to get more shelf facings. So Kent State University’s Keith Cox partnered with a grocery chain over several weeks, and at different locations, he randomly varied the number of shelf facings for four products.

He found the evidence was mixed. Increasing shelf facings seemed to cause an uptick in sales of one product, but not of the other three. The results were not necessarily groundbreaking, but the idea of systematically varying things to test these causal effects is the core idea of experimentation.

You can see similar tests happening these days with the aid of technology, which has allowed experiments to explode online, where they’re often referred to as A/B testing. One of the early adopters was Amazon. The company had an internal debate about whether to give you a recommendation for something else to buy when you’re about to check out. At the time, this was controversial. The concern was that if Amazon gave customers a recommendation for another product, they might read about it and then think, “Oh, maybe I should get that too, but I’ll decide later,” and never complete the purchase.

Amazon decided to test this debate experimentally. When new customers came to the site, they were randomly assigned either to get an extra product recommendation or not. What Amazon found was a big increase in sales when it made the extra recommendation.

Amazon learned from this experiment not just the answer to one question about recommendations. The company took away something much more profound: it should use experimentation whenever feasible to make these decisions, rather than relying on a manager’s judgment or on a roomful of people debating.

Companies develop certain ways of making decisions that they trust. If you didn’t come up in an experimental culture, it’s hard to change.

Today, lots of other companies experiment all the time. Capital One supposedly does 80,000 small experiments in a typical year. Eight to 10 years ago, Google was doing 40,000–70,000 experiments a year. Now, we couldn’t even quantify how many experiments it completes, because it has moved to a continuous-experimentation mode. If you’ve been online in the past 24 hours, you probably have been a participant in one of these experiments.

That said, the use of experimentation is unevenly distributed. Companies develop certain ways of making decisions that they trust. If you didn’t come up in an experimental culture, it’s hard to change.

In businesses that don’t treat experimentation as a core value, how are people making decisions, and how do those methods stack up to experimentation?

Testing decision-making processes

In a case study, I worked with Indranil Goswami, a graduate of Chicago Booth’s PhD Program, who’s now at the University at Buffalo, to compare sources of information for decision-making in the context of fundraising. Specifically, we looked at matching offers. You’ve probably received an appeal in the mail that says, “This is a great time to donate, because for every dollar you give, we have a sponsor who’s also going to give a dollar.”

For fundraisers who plan to use this kind of matching appeal, there are a lot of decisions to make in the wording. What is the basis for these decisions? We can think about precedent: we did this last year and didn’t get a lot of angry letters, so let’s just do it again. For a more sophisticated approach, we can look to expert intuition. From their experience trying different things over time, professional fundraisers may be learning a lot about donor behavior and feedback.

We also can think about this as a marketing research problem. We could show our proposed fundraising appeal to people in a focus group or do an online survey. Finally, there may be economic models that we can use to predict what would happen.

There have been a number of studies testing matching offers. Some of them—such as a 2007 study by Northwestern’s Dean Karlan and University of Chicago’s John A. List—suggested that it helps and more funds are raised. But other studies in other settings found no difference, and a few studies found a slightly lower amount of funds raised when matching was used.

Researchers have proposed that other factors may influence whether the match is successful. The match might be seen as a quality signal. If donors are willing to support this organization and match donations, they must take the organization seriously.

There’s also speculation, though not a lot of evidence, that matching is a social cue. We do some things because we like doing them, but we may also do other things because we like being part of a group of people doing the same thing.

There also could be negative responses among potential donors to a matching appeal. Donors might think that if an organization can get a matching sponsor, it must have lots of ways of getting money, and they should instead give to an organization that needs the money more.

I’m not saying we should make all decisions on the basis of some field experiment. In fact, sometimes that may not be enough.

It’s awkward to ask people for money, so fundraisers also worry about whether an appeal is going to sound coercive or offend the recipients in some way.

There could be two additional factors at work. One is a substitution effect, particularly for repeat donors. A donor might think: last time, I gave $40, and you got $40. This time, I could give $20, and you’d still get $40. It’s like a half-off sale: the donor could spend half the money, make the same impact, and pocket the difference.

We also could think about it as a quality or a norm signal, which is that if an organization has to resort to matching, maybe its regular fundraising wasn’t going that well, and others weren’t giving.

This is a complex situation in which to make a decision. All of these interpretations and motives seem plausible, and that makes it hard for the person designing charity appeals to make a confident prediction.

Changing the frame

In our research, we partnered with the Hyde Park Art Center in Chicago for their 75th-anniversary fundraising drive to come up with different ways to implement a matching campaign. The first change we designed was a framing manipulation: if we’re worried that a matching appeal might be coercive, we could reframe it in nicer terms. There’s something a little strange about a matching manipulation, that could feel like a wealthy sponsor saying, “I’m only going to donate if you take money out of your own wallet.”

We reframed this as “Let me help you donate more.” Instead of saying that for every dollar you donate, the sponsor will donate a dollar as well, the alternative “giving credit” appeal said that for every dollar you contribute, the sponsor will add a dollar to your donation, helping you give more.

The second proposed idea was a threshold match: if we’re worried about the substitution effect, why not have the match kick in only above a certain point? In this version, the appeal communicated to repeat donors that anything you give above what you donated last time will be matched.

As much as possible, we kept the rest of the wording the same, because we were trying to isolate the effects of these different strategies.

Expert guidance is uncertain

We surveyed professional fundraisers with an average of 10 years of experience. We showed them five versions of the appeal: no match at all, the regular match, reframing the match as giving credit, the threshold match with regular wording, and the threshold match reframed as giving credit.

Out of these five, we asked the fundraisers to consider two versions of the appeal in terms of which would be more effective for participation and average contribution. Overwhelmingly, the professional fundraisers said including a match would be more effective than not having one.

The fundraisers didn’t have much direct experience with the giving-credit framing. But overwhelmingly, the majority said it probably would be better both for participation rates and for average contributions among those who sent in a donation.

The fundraisers were split on whether the threshold match would work well. It could help to deal with the substitution effect, or maybe it would instead demotivate or confuse people.

The last question we posed: If we do the threshold match, should we do the giving-credit framing or the standard framing? The professional fundraisers thought that whether or not we were doing the threshold match, the giving-credit framing was a good idea.

In a second survey with fundraisers, we focused only on comparing the giving-credit framing to the standard appeal. We first showed them one appeal and asked them to evaluate that one version, and then we showed them the other appeal and asked them to evaluate that one on its own.

What we find is that when we first showed fundraisers the standard appeal and then the giving-credit framing, they said giving credit was going to be more effective. But if we started off describing the giving-credit appeal and then described the standard appeal, they rated them pretty similarly.

This is strange if we think that professional fundraisers are drawing on their wealth of experience and their mental model of the donor. Their responses shouldn’t be sensitive to uninformative factors, such as the order of the comparison. Throughout behavioral science, we see that people’s judgments change with these kinds of factors when they are making up their minds on the spot, rather than relying on preexisting knowledge. This might shake our confidence in the fundraisers.

The last potential source of information is a market-research study. We designed an online survey that a charity could implement cheaply and quickly online. First, we had respondents choose their favorite charity from a list of 20. We told participants that five people were going to be chosen to win $20, but they had to decide in advance, if they won, how much they wanted to give to their selected charity. Then we randomly assigned them to one of the five appeals.

In the results, there was no strong evidence of any significant differences. The full match plus giving credit did a little better, but it wasn’t a statistically significant difference compared with the other appeals.

So what actually happened when we ran the field study to test the effect of the appeals? We sent out 1,500 mailers. The donation rate was about 5 percent, and the median donation was $100. In the market research study, the donation rate was instead 75 percent! That’s a clear warning sign that the market-research survey might not have accurately captured the thinking of potential donors.

What I Learned from Watching People Clean

Home remedies illustrate the power of design thinking.

CBR - Marketing

Illustration of man on golden path walking toward horizon

Short-Term Rewards Don’t Sap Long-Term Motivation

So stop fretting about temporary incentives.

CBR - Behavioral Science

As it turns out, the giving-credit framing with the threshold match—both of our brilliant ideas combined—actually reduced participation. The threshold match didn’t seem to make a big difference, but the giving-credit framing significantly decreased participation. Overall, there was a negative net effect on how much money was raised, on average, per appeal sent out. I felt very bad that we cost the Hyde Park Art Center money, but our intuitions were no worse than the other sources of information they could have consulted.

We ran the experiment again, in a second fundraising campaign, this time testing only the standard match versus the giving-credit framing. We sent out 3,000 mailers, and 3 percent donated.

The first experiment wasn’t a fluke. The second time, we didn’t see a difference in average contribution, but we saw a huge effect on participation. The giving-credit framing basically cut participation in half, and as a result, net donations were half of what was raised by the standard-match appeal. In fundraising, that’s a massive effect.

So, the result of all this research is not only confirmation that the giving-credit framing is a bad idea, but that it’s a bad idea that we couldn’t really have predicted with the kinds of information available to fundraisers.

No substitute for experiments

You might argue that there’s something unique about fundraising when a field experiment shows terrible results for an idea that experts predicted would be successful. But there are examples in other contexts that are quite consistent.

In education, there’s a study looking at an intervention to have parents receive alerts on their phones when their kids miss school or don’t hand in homework. Researchers polled experts for their predictions about which of four strategies would be most effective in getting parents to sign up for the alerts, but few of the experts actually identified the right one.

In health care, there’s been almost a consensus that “hot spotting” is a good idea. The philosophy is that identifying the subset of patients who are responsible for a huge proportion of medical costs and treating them more intensively the first time they show up to the hospital will reduce costs overall.

The basis of this idea came primarily from observational data. But an experiment just published in the New England Journal of Medicine randomly assigning high-risk patients either to a hot-spotting intervention or to regular treatment found absolutely no difference in hospital readmission rates.

I’m not saying we should make all decisions on the basis of some field experiment. In fact, sometimes that may not be enough. There’s a lot of evidence showing that field experiments done in one context at one time with one population vary in how well they generalize to other settings. What I’m arguing for is, when possible, to conduct in-context field experiments.

This leaves us with the probably disappointing advice that I give to people in industry. When they ask me about the big new ideas from academics that they can implement tomorrow, I typically have to say that I don’t know their business well enough to make a confident recommendation about what would work in their context. What I can give them are some good ideas for experiments and advice on how to conduct them.

Oleg Urminsky is professor of marketing at Chicago Booth, and teaches the Experimental Marketing course on how to use experimental methods to make business decisions. This essay is adapted from a lecture given at Booth’s Kilts Center for Marketing in February 2020.

Works Cited

Peter Bergman, Jessica Lasky-Fink, and Todd Rogers, “Simplification and Defaults Affect Adoption and Impact of Technology, but Decision Makers Do Not Realize It,” Organizational Behavior and Human Decision Processes, In press.
Keith Cox, “The Responsiveness of Food Sales to Shelf Space Changes in Supermarkets,” Journal of Marketing Research, May 1964.
Amy Finkelstein, Annetta Zhou, Sarah Taubman, and Joseph Doyle, “Health Care Hotspotting–A Randomized, Controlled Trial,” New England Journal of Medicine, January 2020.
Indranil Goswami and Oleg Urminsky, “No Substitute for the Real Thing: The Importance of In-Context Field Experiments in Fundraising,” Working paper, January 2020.
Dean Karlan and John A. List, “Does Price Matter in Charitable Giving? Evidence from a Large-Scale Natural Field Experiment,” American Economic Review, November 2007.

NECESSARY COOKIES These cookies are essential to enable the services to provide the requested feature, such as remembering you have logged in.	ALWAYS ACTIVE
	Reject \| Accept
PERFORMANCE AND ANALYTIC COOKIES These cookies are used to collect information on how users interact with Chicago Booth websites allowing us to improve the user experience and optimize our site where needed based on these interactions. All information these cookies collect is aggregated and therefore anonymous.
FUNCTIONAL COOKIES These cookies enable the website to provide enhanced functionality and personalization. They may be set by third-party providers whose services we have added to our pages or by us.
TARGETING OR ADVERTISING COOKIES These cookies collect information about your browsing habits to make advertising relevant to you and your interests. The cookies will remember the website you have visited, and this information is shared with other parties such as advertising technology service providers and advertisers.
SOCIAL MEDIA COOKIES These cookies are used when you share information using a social media sharing button or “like” button on our websites, or you link your account or engage with our content on or through a social media site. The social network will record that you have done this. This information may be linked to targeting/advertising activities.

You Should Run More Experiments

Testing assumptions in the field can avoid costly errors in decision-making.