Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!: Communication Methods and Measures: Vol 0, No 0

ABSTRACT

Automated classifiers (ACs), often built via supervised machine learning (SML), can categorize large, statistically powerful samples of data ranging from text to images and video. They have become widely popular measurement devices in communication science and related fields. Despite this popularity, even highly accurate classifiers make errors that cause misclassification bias and misleading results when input to downstream statistical analyses–unless such analyses account for these errors. As we show in a systematic literature review of SML applications, communication scholars largely ignore misclassification bias. In principle, existing statistical methods can use “gold standard” validation data, such as that created by human annotators, to correct misclassification bias. We introduce and test such methods, including a new method we design and implement in the R package misclassification_models, via Monte Carlo simulations designed to reveal each method’s limitations, which we also release. Based on our results, we recommend our new error correction method as it is versatile and efficient. In sum, automated classifiers, even those below common accuracy standards or those making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods.

Author note

We have uploaded a preprint of this article (TeBlunthuis et al., Citation2023). An online appendix with additional results and discussion complements the main text and is available in an Open Science Framework (OSF) repository https://osf.io/pyqf8/files/osfstorage/656f977c783ec62762eddabf.

Acknowledgement

Dr. TeBlunthuis lead the project conception, wrote the initial draft, and developed the simulations, real-data example, and error correction methods. Dr. Hase designed and lead the literature review. Dr. Chan contributed to the project conception, to the literature review, by leading the r package development, and by advising. All authors contributed to writing.

We thank the Computational Methods Hackathon pre-conference at the 2022 annual meeting of the International Communication Association (ICA). Dr. TeBlunthuis presented a workshop on misclassification bias at this hackathon and connected with Dr. Chan, who was already working on this problem, and with Dr. Hase to begin our collaboration. We also thank our colleagues who provided feedback on this manuscript at various stages including: Benjamin Mako Hill, Nick Vincent, Aaron Shaw, Ruijia Cheng, Kaylea Champion, and Jeremy Foote with the Community Data Science Collective as well as Abigail Jacobs, Manuel Horta Ribero, Nicolai Berk, and Marko Bachl. We also thank the anonymous reviewers with the computational methods division at ICA 2023 whose comments helped improve this manuscript and to the division’s awards committee for the recognition. We similarly thank the editorial team and anonymous reviewers at Communication Methods & Measures for the invaluable comments. Any remaining errors are our own. Additional thanks to those who shared data with Dr. TeBlunthuis during the project’s conception: Jin Woo Kim, Sandra Gonzalez-Bailon, and Manlio De Domenico. This work was facilitated through the use of advanced computational, storage, and networking infrastructure provided by the Hyak supercomputer system at the University of Washington.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/19312458.2023.2293713.

Notes

¹ The appendix is available in an Open Science Framework (OSF) repository https://osf.io/pyqf8/files/osfstorage/656f977c783ec62762eddabf.

² The measurement error literature defines categories of measurement error, of which we will discuss four: (1) Measurement error in an IV is classical when the measurement equals the true value plus noise. In other words, it is nonsystematic and the variance of an AC’s predictions is greater than the variance of the true value (Carroll et al., Citation2006). (2) Berkson measurement error in an IV is when the truth equals the measurement plus noise. In agreement with prior work, we assume that nonsystematic measurement error from ACs is classical because an AC is an imperfect model of reality (Fong & Tyler, Citation2021; Zhang, Citation2021). Similarly, it is hard to imagine how an AC would have Berkson errors as predictions would then have lower variance than the training data. We thus do not consider Berkson error. (3) Measurement error in an IV is called differential if it is not conditionally independent of the DV given the other IVs. (4) Measurement error in a DV is called systematic when it is correlated with an IV. We use this more general term to simplify our discussions that pertain equally to misclassified IVs and DVs.

³ Automated content analysis includes a range of methods both for assigning content to predefined categories (e.g., dictionaries) and for assigning content to unknown categories (e.g., topic modeling) (Grimmer & Stewart, Citation2013). While we focus on SML, our arguments extend to other approaches such as dictionary-based classification and even beyond the specific context of text classification.

⁴ Metric variables were created in 35% of studies, mostly via the non-parametric method by Hopkins and King (Citation2010).

⁵ Statisticians have studied other methods including simulation extrapolation and score function methods. As we argue in the Appendix, these error correction methods are not advantageous when manually annotated data is available, as is often the case with ACs.

⁶ In particular see Chapter 8 (especially example 8.4) and Chapter 15 (especially 15.4.2).

⁷ It is worth noting that an analogous method can be implemented in a Bayesian framework instead of MLE. A Bayesian approach may provide additional strengths in flexibility and uncertainty quantification, but bears additional complexity.

⁸ Fong and Tyler (Citation2021) describe their method within an instrumental variable framework, but it is equivalent to regression calibration, the standard term in measurement error literature.

⁹ The code for reproducing our simulations and our experimental R package is available here: https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e.

¹⁰ Bayesian networks are similar to causal directed acyclic graphs (DAGs), an increasingly popular representation in causal inference methodology. DAGs are Bayesian networks with directed edges that indicate the direction of cause and effect. We use Bayesian networks for generality and because causal direction is not important in the measurement error theory we use.

¹¹ Classifier accuracy varies between our simulations because it is difficult to jointly specify classifier accuracy and the required correlations among variables and due to random variation between simulation runs. We report the median accuracy over simulation runs.

¹² Compare Equation D4 in the Appendix to Equations 24–28 from Zhang (Citation2021).

Additional information

Notes on contributors

Nathan TeBlunthuis

Dr. Nathan TeBlunthuis (PhD University of Washington, 2021) is a Postdoctoral Research Fellow at the Universtity of Michigan School of Information. His scholarship focuses on online communities, digitally mediated organization, collaborative knowledge work, and computational methods for communication research.

Valerie Hase

Dr. Valerie Hase is a postdoctoral researcher at the Department of Media and Communication at LMU Munich, Germany. Her research focuses on automated content analysis, digital trace data, bias in computational methods, and digital journalism.

Chung-Hong Chan

Dr. Chung-hong Chan (PhD University of Hong Kong, 2018) is Senior Researcher in the Department of Computational Social Science, GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany, and External Fellow at the Mannheim Center for European Social Research, University of Mannheim (Germany). An epidemiologist by training, he is interested in developing new quantitative methods for communication research.

Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!

Notes on contributors

Nathan TeBlunthuis

Valerie Hase

Chung-Hong Chan

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!

ABSTRACT

Author note

Acknowledgement

Disclosure statement

Supplementary material

Notes

Additional information

Notes on contributors

Nathan TeBlunthuis

Valerie Hase

Chung-Hong Chan

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature