JMIR formative research. 2025 Jun 11:9:e63272. doi: 10.2196/63272 N/A2.02024

Improving Suicidal Ideation Detection in Social Media Posts: Topic Modeling and Synthetic Data Augmentation Approach

社交媒体帖子中改善自杀念头检测：主题建模与合成数据增强方法翻译改进

Hamideh Ghanadian¹, Isar Nejadgholi², Hussein Al Osman¹

作者单位 +展开

作者单位

¹ School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, Canada.

² National Research Council Canada, Ottawa, ON, Canada.

DOI: 10.2196/63272 PMID: 40499163

摘要中英对照阅读

Background: In an era dominated by social media conversations, it is pivotal to comprehend how suicide, a critical public health issue, is discussed online. Discussions around suicide often highlight a range of topics, such as mental health challenges, relationship conflicts, and financial distress. However, certain sensitive issues, like those affecting marginalized communities, may be underrepresented in these discussions. This underrepresentation is a critical issue to investigate because it is mainly associated with underserved demographics (eg, racial and sexual minorities), and models trained on such data will underperform on such topics.

Objective: The objective of this study was to bridge the gap between established psychology literature on suicidal ideation and social media data by analyzing the topics discussed online. Additionally, by generating synthetic data, we aimed to ensure that datasets used for training classifiers have high coverage of critical risk factors to address and adequately represent underrepresented or misrepresented topics. This approach enhances both the quality and diversity of the data used for detecting suicidal ideation in social media conversations.

Methods: We first performed unsupervised topic modeling to analyze suicide-related data from social media and identify the most frequently discussed topics within the dataset. Next, we conducted a scoping review of established psychology literature to identify core risk factors associated with suicide. Using these identified risk factors, we then performed guided topic modeling on the social media dataset to evaluate the presence and coverage of these factors. After identifying topic biases and gaps in the dataset, we explored the use of generative large language models to create topic-diverse synthetic data for augmentation. Finally, the synthetic dataset was evaluated for readability, complexity, topic diversity, and utility in training machine learning classifiers compared to real-world datasets.

Results: Our study found that several critical suicide-related topics, particularly those concerning marginalized communities and racism, were significantly underrepresented in the real-world social media data. The introduction of synthetic data, generated using GPT-3.5 Turbo, and the augmented dataset improved topic diversity. The synthetic dataset showed levels of readability and complexity comparable to those of real data. Furthermore, the incorporation of the augmented dataset in fine-tuning classifiers enhanced their ability to detect suicidal ideation, with the F₁-score improving from 0.87 to 0.91 on the University of Maryland Reddit Suicidality Dataset test subset and from 0.70 to 0.90 on the synthetic test subset, demonstrating its utility in improving model accuracy for suicidal narrative detection.

Conclusions: Our results demonstrate that synthetic datasets can be useful to obtain an enriched understanding of online suicide discussions as well as build more accurate machine learning models for suicidal narrative detection on social media.

Keywords: large language models; social media; suicidal ideation detection; synthetic data; topic modeling.

©Hamideh Ghanadian, Isar Nejadgholi, Hussein Al Osman. Originally published in JMIR Formative Research (https://formative.jmir.org), 11.06.2025.

Keywords：suicidal ideation detection; social media posts; topic modeling

背景： 在社交媒体对话主导的时代，了解自杀（一个重要的公共卫生问题）在线上的讨论方式至关重要。围绕自杀的讨论通常涉及各种话题，例如心理健康挑战、人际关系冲突和财务压力。然而，某些敏感议题，如影响边缘化社区的问题，在这些讨论中可能被低估。这种代表性不足是一个关键的研究问题，因为它主要与服务欠缺的人群（如种族和性少数群体）相关，并且基于此类数据训练的模型在处理这些问题时表现不佳。

目的： 本研究旨在通过分析在线讨论的话题来弥合自杀意念的心理学文献与社交媒体数据之间的差距。此外，我们希望通过生成合成数据确保用于训练分类器的数据集涵盖了重要的风险因素，并充分代表了被低估或歪曲的主题。这种方法可以提高在社交媒体对话中检测自杀意念所用数据的质量和多样性。

方法： 首先，我们进行了无监督主题建模来分析来自社交媒体的自杀相关数据并识别数据集中最常讨论的话题。接下来，我们进行了一项范围审查以确定与自杀相关的核心风险因素，并使用这些已识别的风险因素对社交媒体数据集进行指导性主题建模，评估这些因素的存在和覆盖范围。在发现数据中的主题偏差和缺口后，我们探索了使用生成式大型语言模型来创建多样化的合成数据以增强数据的方法。最后，将合成数据集与现实世界的数据集相比，在可读性、复杂性和用于训练机器学习分类器的实用性方面进行了评估。

结果： 我们的研究发现，某些重要的自杀相关话题（特别是涉及边缘化社区和种族主义的话题）在现实世界的社交媒体数据中显著被低估。使用GPT-3.5 Turbo生成的合成数据及增强的数据集提高了主题多样性。合成数据展示了与实际数据相当的可读性和复杂性水平。此外，在微调分类器时，引入了增强的数据集可以提高其检测自杀意念的能力，F₁-得分从0.87提升到0.91（在马里兰大学Reddit自杀倾向数据集中测试子集上），以及从0.70提升到0.90（在合成测试子集上），证明了其提高模型准确性的实用性。

结论： 我们的研究结果表明，合成数据集可用于获得对在线自杀讨论的丰富理解，并构建更准确的用于社交媒体中检测自杀叙事的机器学习模型。

关键词：大型语言模型；社交媒体；自杀意念检测；合成数据；主题建模。

©Hamideh Ghanadian, Isar Nejadgholi, Hussein Al Osman. 原文发表于JMIR Formative Research (https://formative.jmir.org)，2025年6月11日。

关键词：自杀意念检测; 社交媒体帖子; 主题建模

翻译效果不满意？用Ai改进或寻求AI助手帮助，对摘要进行重点提炼

相关内容

期刊名：Jmir formative research

缩写：

ISSN：2561-326X

e-ISSN：

IF/分区：2.0/N/A

文章目录更多期刊信息

全文链接

官方链接

PMC全文

引文链接

复制

已复制！

格式：

Improving Suicidal Ideation Detection in Social Media Posts: Topic Modeling and Synthetic Data Augmentation Approach

社交媒体帖子中改善自杀念头检测：主题建模与合成数据增强方法翻译改进

MultiWD: Multi-label wellness dimensions in social media posts

多标签健康维度在社交媒体帖子中的应用

Twitter-Based Sentiment Analysis and Topic Modeling of Social Media Posts Using Natural Language Processing, to Understand People's Perspectives Regarding COVID-19 Booster Vaccine Shots in India: Crucial to Expanding Vaccination Coverage

利用自然语言处理对社交媒体文章进行基于Twitter的舆情分析和主题建模，以了解印度民众对于新冠疫苗加强针的看法：扩大疫苗覆盖的关键

Clinical Age-Specific Seasonal Conjunctivitis Patterns and Their Online Detection in Twitter, Blog, Forum, and Comment Social Media Posts

基于Twitter、博客、论坛和评论的社交媒体帖子的临床年龄特异性季节性结膜炎模式及其在线检测

Trends in Glucagon-Like Peptide-1 Receptor Agonist Social Media Posts Using Artificial Intelligence

使用人工智能分析胰高血糖素样肽-1受体激动剂的社会媒体帖子趋势

Patients' Perspectives on Qualitative Olfactory Dysfunction: Thematic Analysis of Social Media Posts

嗅觉障碍患者的社会媒体发文主题分析

Evaluating the predictability of medical conditions from social media posts

评估社交媒体帖子对医疗状况预测能力的影响

Analysis of Social Media Posts That Promote Women Surgeons

分析推广女性外科医生的社交媒体帖子

Social Media Posts by Recreational Marijuana Companies and Administrative Code Regulations in Washington State

华盛顿州娱乐性大麻公司社交媒体帖子和行政法规

An Exploratory Content Analysis of the Use of Health Communication Strategies and Presence of Objectification in Fitness Influencer Social Media Posts

健身网红社交媒体帖子中健康传播策略使用情况及物化现象的探索性内容分析

Improving Suicidal Ideation Detection in Social Media Posts: Topic Modeling and Synthetic Data Augmentation Approach

社交媒体帖子中改善自杀念头检测：主题建模与合成数据增强方法 翻译改进

MultiWD: Multi-label wellness dimensions in social media posts

多标签健康维度在社交媒体帖子中的应用

Twitter-Based Sentiment Analysis and Topic Modeling of Social Media Posts Using Natural Language Processing, to Understand People's Perspectives Regarding COVID-19 Booster Vaccine Shots in India: Crucial to Expanding Vaccination Coverage

利用自然语言处理对社交媒体文章进行基于Twitter的舆情分析和主题建模，以了解印度民众对于新冠疫苗加强针的看法：扩大疫苗覆盖的关键

Clinical Age-Specific Seasonal Conjunctivitis Patterns and Their Online Detection in Twitter, Blog, Forum, and Comment Social Media Posts

基于Twitter、博客、论坛和评论的社交媒体帖子的临床年龄特异性季节性结膜炎模式及其在线检测

Trends in Glucagon-Like Peptide-1 Receptor Agonist Social Media Posts Using Artificial Intelligence

使用人工智能分析胰高血糖素样肽-1受体激动剂的社会媒体帖子趋势

Patients' Perspectives on Qualitative Olfactory Dysfunction: Thematic Analysis of Social Media Posts

嗅觉障碍患者的社会媒体发文主题分析

Evaluating the predictability of medical conditions from social media posts

评估社交媒体帖子对医疗状况预测能力的影响

Analysis of Social Media Posts That Promote Women Surgeons

分析推广女性外科医生的社交媒体帖子

Social Media Posts by Recreational Marijuana Companies and Administrative Code Regulations in Washington State

华盛顿州娱乐性大麻公司社交媒体帖子和行政法规

An Exploratory Content Analysis of the Use of Health Communication Strategies and Presence of Objectification in Fitness Influencer Social Media Posts

健身网红社交媒体帖子中健康传播策略使用情况及物化现象的探索性内容分析

社交媒体帖子中改善自杀念头检测：主题建模与合成数据增强方法翻译改进