For this study, we applied the popular Latent Dirichlet Allocation (LDA) topic model,
commonly used in communication studies, partly based on code developed for the study of
same-sex marriage and marijuana legalization discourse on Reddit in, Babak Hemmatian,
Sabina J. Sloman, Uriel Cohen Priva & Steven A. Sloman, Think of the Consequences: A
Decade of Discourse about Same-sex Marriage, 51 BEHAVIOR RESEARCH METHODS, March
11, 2019; and, Babak Hemmatian, Taking the High Road: A Big Data Investigation of
Natural Discourse in the Emerging U.S. Consensus about Marijuana Legalization, Thesis
(Ph.D.), Brown University, February 12, 2022. The original exposition of LDA can be found
in: David M. Blei, Andrew Y. Ng & Michael I. Jordan, Latent Dirichlet Allocation, 3 J.
MACH. LEARNING RSCH. 993, 993–1022 (2003). Other examples of the method’s use can be
found in, Ilana Heintz, Ryan Gabbard, Mahesh Srivastava, Dave Barner, Donald Black,
Majorie Friedman & Ralph Weischedel, Automatic Extraction of Linguistic Metaphors
with LDA Topic Modeling, PROC. FIRST WORKSHOP ON METAPHOR IN NLP 58 (2013);
Daniel Maier, A. Waldherr, P. Miltner, G. Wiedemann, A. Niekler, A. Keinert, B. Pfetsch, G.
Heyer, U. Reber, T. Häussler, H. Schmid-Petri & S. Adam, Applying LDA Topic Modeling
in Communication Research: Toward a Valid and Reliable Methodology, 12 COMMC’N
METHODS & MEASURES 93 (2018); Hamed Jelodar, Yongli Wang, Chi Yuan, Xia Feng,
Xiahui Jiang, Yanchao Li & Liang Zhao, Latent Dirichlet Allocation (LDA) and Topic
Modeling: Models, Applications, a Survey, 78 MULTIMEDIA TOOLS & APPLICATIONS 15169
(2019). We chose the LDA approach because past research has shown it can reveal
semantic content of natural language beyond the level of words, allowing for the
differentiation of multiple meanings of a single term. Paul DiMaggio, Manish Nag, & David
Blei, Exploiting Affinities Between Topic Modeling and the Sociological Perspective on
Culture: Application to Newspaper Coverage of U.S. Government Arts Funding, 41
POETICS 570, (2013) (LDA is basically “a statistical model of language”). This model is also
appealing for its ability to identify changes over time in the topics occurring in a large
corpus of natural language data. Both properties are empirically demonstrated in the
published work from which our code base is derived.
To improve the quality of our topic model, we applied common preprocessing
techniques to the dataset. We changed all words in our corpus to lowercase to avoid
different cases of the same word being treated as different words and changed different
grammatical forms of the same words to a uniform lemma (a process called
lemmatization). HTML escape codes, uninformative stop words, URLs, new line
characters, punctuation, ubiquitous terms (words that appeared in 99% of the documents),
rare terms (those appearing in a single document), as well as non-alphanumeric characters
were removed from the dataset. We used the lemmatizer from the SpaCy python package
and the set of stop words from the Natural Language Toolkit Bird, Klein, & Loper, NLTK,
(2009), respectively. Our corpus contained 22,692 articles (27,797,084 words in total,
comprising 23,438 unique words) with a mean document length of 1,224 words (median =
536, SD = 2168.7).