I am working with a team of economists from four other universities to develop indices that describe human movement and social contact in the context of the COVID-19 pandemic.
This team of researchers had been working with smartphone-movement data for a while on unrelated subjects. But when the pandemic started, we realized that we had access to very valuable information that describes what people are doing in close-to-real time with a lot of spatial precision, and that becomes a very valuable data resource in the context of thinking about transmission of the disease between different cities, by people encountering each other in bars, restaurants, etc. And so we quickly pivoted to producing indices that would be useful in the context of the ongoing pandemic.
In the last decade or so, the norm in economics has become to share your code and your data after you’ve published your research in an academic journal by uploading that code and data into an academic repository. What we’ve done during this crisis is a little bit different, in the sense that I’ve been sharing code and research output via GitHub, which is a collaborative platform online that’s used by a lot of software developers to work with each other and publish and archive their code.
This process is different in the sense that usually the sharing of replication materials in academic repositories after the fact is to sort of show that you’ve done the work and verified those outputs that you’ve already published in an academic journal, as opposed to it being more of a living repository with active collaboration where outside researchers can examine the work that you’re currently pursuing.
When the pandemic started, we realized that smartphone-movement data could be relevant for learning about a lot of different aspects of the pandemic, and it simply wasn’t going to be possible for us as a team of researchers to explore all those different important questions. And the fact that the pandemic was escalating relatively quickly meant that there was a real sense of urgency, in the sense that if we were sitting on the data, that would be delaying the ability of other social scientists to tackle these important questions.
So we decided to focus our efforts on producing data that is measured in a consistent and understandable way, because the cell-phone data in its very raw form is billions of pings generated by millions of devices—so the raw data is very messy. And what we’ve done in partnership with our data providers is put a lot of work into cleaning up that data in order to produce consistent measures of behavior.
And then we decided to release those indices to the general research community so that with those measures in hand, lots of different economists and other social scientists could use it to tackle a variety of questions, basically massively multiplying the amount of brain power and amount of work hours that were dedicated to studying this data, compared to what would have happened if we had tried to just keep it with our team of five economists and a few research assistants.
So for example, [Chicago Booth’s] Brent Neiman and I have recently produced research in which we estimate the fraction of jobs that can be done from the comfort of your own home, which is a question that we hadn’t thought about a lot prior to 2020. But now it’s very salient, in the sense that the ongoing pandemic has forced a lot of people to work from home. And so when we produced our estimate of how many people could work from home, at the same time that we produced our research note as a document that we shared with the research community, we also put up all the code and the results of running that code, the data that says, here’s our classification of each occupation and whether it can be performed at home or not.
And in the case of this particular research project, we’ve realized the benefits of sharing early, in the sense that somebody could find a small typo in our code, in the sense that we produced information about the fraction of jobs that could be done at home for each city in America, and then a graduate student saw that and wanted to produce a state-level measure in order to look at how it correlated with a bunch of state outcomes, say unemployment insurance claims. And so that graduate student took our code, modified it in order to produce the answer for states instead of cities, and then shared it back with us. And so that is also now in the GitHub repository as well.
So it’s a more collaborative approach in which you share information in a way that other people can both inspect the work that you’ve done but also potentially contribute or tell you when they’ve, say, found a shortcoming or an additional extension that they want to pursue. In the context of a global pandemic, it’s particularly valuable to share our findings with the global research community. We have produced those numbers for the United States, but researchers from all over the globe wanted to be able to produce numbers for Malaysia, Australia, the United Kingdom, Germany, South Africa. And by sharing our results, even prior to them being published in an academic journal, it meant that a lot of information that’s relevant for public policy and thinking about the economic consequences of the crisis was disseminated very quickly to the global research community.
It’s been great to see a lot of people in the profession, in the context of this pandemic, abandon the sort of narrow-minded or usual professional incentives to try to maximize the private returns to having access to data. And instead a lot of people have started contributing data to, sort of, shared public resources, recognizing that there’s a lot of value in disseminating that information, even if the traditional professional incentives don’t necessarily push in that direction in terms of data sharing.