This paper investigates whether Large Language Models (LLMs), guided by prompt engineering, can automate the complex Discourse Quality Index (DQI) for measuring deliberative quality. Evaluating state-of-the-art models from OpenAI and Google by varying In-Context Learning (ICL) examples (0–50) shows LLMs achieve high fidelity, comparable to human annotators based on standard reliability metrics. Performance improves markedly with few examples, plateauing around 25–50 examples. While both models perform well, differences highlight the interplay between model selection and ICL strategy. Error analysis identifies specific DQI dimensions requiring further improvement, suggesting future work on advanced reasoning prompts. This study confirms LLM viability for scaling DQI measurement and provides practical guidance on optimizing ICL strategies. Additionally, it contributes a modular, adaptable AI engineering pipeline that researchers can leverage for their own prompting experiments across various measurement tasks.
Online Appendix | 113.00000128_app.pdf
This is the article's accompanying appendix.
Companion
Journal of Political Institutions and Political Economy, Volume 6, Issue 3-4 Special Issue: Artificial Intelligence and the Study of Political Institutions
See the other articles that are part of this special issue.