(Mis)Measuring the Drivers of Ad Performance
Publication Date
9-15-2025
Abstract
We study the potential risks and benefits of using large-language model (LLM) annotations in video ad creative research. Using a custom-built, large-scale dataset of over 10,000 human-labeled video ads, we demonstrate that off-the-shelf multimodal LLMs perform poorly when encoding certain types of features. We then show, using ad quality ratings from a large (500+) consumer panel provided by iSpot.tv, that such misaligned measurement may lead to downstream effect estimates that are significant in the opposite direction to those inferred with human-labeled data. However, we demonstrate that such bias can be largely mitigated by fine-tuning a model using our large-scale human annotations. This fine-tuned model exceeds average pairwise human agreement on many features, realigns downstream estimates with those based on human annotations, and substantially improves the explanatory power of labeled content features for ad performance, allowing for the recovery of significant effects that are otherwise missed when using human-labeled data due to inter-annotator noise.
Document Type
Article
Keywords
advertising, large-language models, multimodal AI, ad creative
Disciplines
Marketing
DOI
10.2139/ssrn.5494548
Source
SMU Cox: Marketing (Topic)
Language
English
