Causal Inference on Outcomes Learned from Text

On April 11, 2025 at 11:00 am till 12:00 pm

Jann Spiess, Stanford University

E18-304

Abstract:

(with Iman Modarressi and Amar Venugopal; arxiv.org/abs/2503.00725

We propose a machine-learning tool that yields causal inference on text in randomized trials. Based on a simple econometric framework in which text may capture outcomes of interest, our procedure addresses three questions: First, is the text affected by the treatment? Second, which outcomes is the effect on? And third, how complete is our description of causal effects? To answer all three questions, our approach uses large language models (LLMs) that suggest systematic differences across two groups of text documents and then provides valid inference based on costly validation. Specifically, we highlight the need for sample splitting to allow for statistical validation of LLM outputs, as well as the need for human labeling to validate substantive claims about how documents differ across groups. We illustrate the tool in a proof-of-concept application using abstracts of academic manuscripts.

Bio:
Jann is an econometrician in the OIT group at Stanford GSB. His current research focuses broadly on two related themes: (1) high-dimensional and robust causal inference, including work on using machine learning to improve inferences from randomized trials, robust inference in panel data, synthetic control, matching estimation, highly over-parametrized models, and high-dimensional outcome data; and (2) data-driven decision-making with misaligned objectives, including work on algorithmic fairness, human–AI interaction, the regulation of algorithms, and the design of pre-analysis plans. He holds a PhD in economics from Harvard University.

Events

Causal Inference on Outcomes Learned from Text