amount of data would be lost. The third and adopted option
was storing all the comments’ data and doing save truncate,
which keeps only the first 512 tokens of the comment body.
X. CONCLUSION AND FUTURE WORK
This capstone project was successful in establishing a com-
prehensive data pipeline for sentiment and similarity analysis
on StackOverflow, giving insights into the way that users in-
teract within the selected programming language discussions.
The NLP models, such as SBERT and EmoRoBERTa, enabled
the identification of duplicate posts and the analysis of nuanced
emotional undertones hidden across those conversations. The
research highlights the potential of NLP models to improve
community management and user experience on online plat-
forms.
Subsequently, future research can further broaden the ex-
isting models by integrating real-time analysis tools for the
immediate detection and cleansing of both duplicated content
and toxic interactions as they happen on social media. More-
over, integrating a more extensive emotional analysis that is
specific to each programming language can give better insights
into the roots of the toxic behaviors. Therefore, by continuing
to get up-to-date data from the pipeline and incorporating
more dynamic models, this capstone project can broaden its
impact and foster a healthier and more engaged community
on StackOverflow and beyond.
REFERENCES
[1] heartexlabs. (2022, September). labelImg. Github. https://github.com/h
eartexlabs/labelImg
[2] B. V. Kok-Shun, J. Chan, G. Peko, and D. Sundaram, Chaining GPT
and Roberta for Emotion Detection. https://ieee-csde.org/csde2023/w
p-content/uploads/2023/10/IEEE
CSDE 1155.pdf (accessed May 8,
2024)
[3] P. Ekman, “Basic Emotions,” in Handbook of Cognition and Emotion,
Chichester, UK: John Wiley & Sons, Ltd, 2005, pp. 45–60. doi:10.100
2/0470013494.ch3 (accessed May 1, 2024)
[4] R. Plutchik, “A General Psychoevolutionary Theory of Emotion,” in
Theories of Emotion, Elsevier, 1980, pp. 3–33. doi:10.1016/B978-0-1
2-558701-3.50007-7 (accessed May 4, 2024)
[5] A. Ortony, G. L. Clore, and A. Collins, “The cognitive structure of
emotions,” Cambridge Core, https://www.cambridge.org/core/books/cog
nitive-structure-of-emotions/33FBA9FA0A8A86143DD86D84088F28
9B (accessed May 9, 2024).
[6] D. Demszky et al., “Goemotions: A dataset of fine-grained emotions,”
ACL Anthology, https://aclanthology.org/2020.acl-main.372/(accessed
May9,2024).
[7] F. A. Acheampong, H. Nunoo-Mensah, and W. Chen, (PDF) comparative
analyses of Bert, Roberta, Distilbert, and XLNet for text-based emotion
recognition, https://www.researchgate.net/publication/346443459 Com
parative Analyses of BERT RoBERTa DistilBERT and XLNet for
Text-based Emotion Recognition (accessed Apr. 4, 2024).
[8] C. Hashemi-Pour and B. Lutkevich, “What is the bert language model?:
Definition from techtarget.com,” Enterprise AI, https://www.techtarget
.com/searchenterpriseai/definition/BERT-language-model#:
∼
:text=Wha
t%20is%20BERT%3F,surrounding%20text%20to%20establish%20con
text. (accessed Apr. 9, 2024).
[9] R. Kamath, S. Eswaran, A. Ghoshal, and P. Honnavalli, An enhanced
context-based emotion detection model using Roberta — IEEE confer-
ence publication — IEEE xplore, https://ieeexplore.ieee.org/document
/9865796 (accessed Mar. 1, 2024).
[10] Roberta, https://huggingface.co/docs/transformers/main/en/model doc/r
oberta(accessed Mar. 1, 2024).
[11] J. Atwood, “Handling duplicate questions,” Stack Overflow, https://st
ackoverflow.blog/2009/04/29/handling-duplicate-questions/ (accessed
Feb. 29, 2024).
[12] S. AI, “Sentiment analysis-using NLTK Vader,” Medium, https://medium
.com/@skillcate/sentiment-analysis-using-nltk-vader-98f67f2e6130#:
∼
:text=Brief%20on%20NLTK%20Vader,communicated%20in%20web
%2Dbased%20media. (accessed Mar. 16, 2024).
[13] A. Sen, “Sbert: How to use sentence embeddings to solve real-world
problems,” Medium, https://anirbansen2709.medium.com/sbert-how-t
o-use-sentence-embeddings-to-solve-real-world-problems-f950aa300
c72 (accessed Mar. 14, 2024).
[14] A. Ghoshal, “Arpanghoshal/emoroberta · hugging face,”
arpanghoshal/EmoRoBERTa · Hugging Face, https:// hu gg in gface.
co/arpanghoshal/EmoRoBERTa (accessed Apr. 29, 2024).
[15] D. Sarkar, “Chapter 6,” in Text Analytics with Python, pp. 283–285 ht
tps://kfsyscc.github.io/attachments/IT/Text Analytics with Python.pdf
(accessed Mar. 29, 2024).
[16] Fatih Karabiber Ph.D. in Computer Engineering, Fatih Karabiber Ph.D.
in Computer Engineering, E. R. Psychometrician, and E. B. F. of
LearnDataSci, “Cosine similarity,” Learn Data Science - Tutorials,
Books, Courses, and More, https://www.learndatasci.com/glossary/
cosine-similarity/ (accessed Mar. 14, 2024).
[17] N. Reimers and I. Gurevych, Sentence-BERT: Sentence Embeddings
using Siamese BERT-Networks, http://arxiv.org/pd f/1 908 .10084
(accessed Apr. 18, 2024).
[18] V. Efimov, “Large language models: Sbert - sentence-bert,” Medium,
https://towardsdatascience.com/sbert-deb3d4aef8a4 (accessed Apr. 9,
2024).
[19] Ling, L., and Larsen, S. E. (2018). Sentiment Analysis on Stack
Overflow with Respect to Document Type and Programming Language.
KTH.http://www.diva-portal.org/smash/record.jsf?pid=diva2:1214448
(accessed Apr. 21, 2024).
[20] Novielli, N., Calefato, F., and Lanubile, F. (2014). Towards discovering
the role of emotions in stack overflow. University of Bari. https://doi.or
g/10.1145/2661685.2661689 (accessed Apr. 29, 2024).