Social science research with better data

A common problem in economic research is that a crucial variable is either not recorded, or it is not accurately captured. A typical example: when people self-report how much they earned last year, they might either forget or deliberately not mention some sources of income.

Ten years ago, administrative datasets were rarely used in economic research. Of the papers that came out in a top economics journal (AER), 4% of research findings relied on administrative data 11 years ago. Now such data sets are about six times more common:

Source: Economics in the age of big data

The benefits are not just large sample sizes but also accuracy and (often) the option to track individuals over time or to observe additional covariates. Assuming that the research design is typically sensible, it seems fair to expect that findings based on such datasets should be more credible than older results.

As Einav and Levin and say in a different paper about changes in economic analysis "because the coverage is “universal,” administrative data sets can be linked to other, potentially more selective, data." (This is great for researchers but not for privacy advocates or privacy-conscious citizens.)

Longer trends

Many topics analyzed with the type of data praised in this post have a "micro" flavor but I'd say a lot of this work will have macroeconomic implications. Researchers who study how much we are mismeasuring productivity or how people spend their time (and money) obviously produce inputs relevant for macroeconomists. This is the "micro-data for macroeconomics" research program, which I like a lot.


Noah Smith posted this chart today which shows "big data" is used in a small proportion of NBER papers for now:

The general trend toward identification and breadth in social science seems to be real. That's great news because we can learn whether our theories make sense...


MasterCard markets a product called “SpendingPulse” that provides real- time consumer spending data in different retail categories, and Visa generates periodic reports
that successfully predict survey-based outcomes ahead of time. Similarly, Automatic Data Processing (ADP) and Moody’s Analytics release a monthly report on private- sector employment, based on data from the roughly 500,000 firms for which ADP provides payroll software.

These approaches still have some disadvantages relative to government survey measures. Although the underlying data samples are large, they are essentially “convenience samples” and may not be entirely representative. They depend on who has a Visa or MasterCard and decides to use it, or on which firms are using ADP to manage their
payroll records. On the other hand, the data are available at high frequency and granularity, and their representativeness could be assessed empirically.

Show Comments

Stay in touch with me - I send out occasional updates.