首页 正文

JMIR formative research. 2025 Jun 11:9:e63272. doi: 10.2196/63272 N/A2.02024

Improving Suicidal Ideation Detection in Social Media Posts: Topic Modeling and Synthetic Data Augmentation Approach

社交媒体帖子中改善自杀念头检测:主题建模与合成数据增强方法 翻译改进

Hamideh Ghanadian  1, Isar Nejadgholi  2, Hussein Al Osman  1

作者单位 +展开

作者单位

  • 1 School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, Canada.
  • 2 National Research Council Canada, Ottawa, ON, Canada.
  • DOI: 10.2196/63272 PMID: 40499163

    摘要 中英对照阅读

    Background: In an era dominated by social media conversations, it is pivotal to comprehend how suicide, a critical public health issue, is discussed online. Discussions around suicide often highlight a range of topics, such as mental health challenges, relationship conflicts, and financial distress. However, certain sensitive issues, like those affecting marginalized communities, may be underrepresented in these discussions. This underrepresentation is a critical issue to investigate because it is mainly associated with underserved demographics (eg, racial and sexual minorities), and models trained on such data will underperform on such topics.

    Objective: The objective of this study was to bridge the gap between established psychology literature on suicidal ideation and social media data by analyzing the topics discussed online. Additionally, by generating synthetic data, we aimed to ensure that datasets used for training classifiers have high coverage of critical risk factors to address and adequately represent underrepresented or misrepresented topics. This approach enhances both the quality and diversity of the data used for detecting suicidal ideation in social media conversations.

    Methods: We first performed unsupervised topic modeling to analyze suicide-related data from social media and identify the most frequently discussed topics within the dataset. Next, we conducted a scoping review of established psychology literature to identify core risk factors associated with suicide. Using these identified risk factors, we then performed guided topic modeling on the social media dataset to evaluate the presence and coverage of these factors. After identifying topic biases and gaps in the dataset, we explored the use of generative large language models to create topic-diverse synthetic data for augmentation. Finally, the synthetic dataset was evaluated for readability, complexity, topic diversity, and utility in training machine learning classifiers compared to real-world datasets.

    Results: Our study found that several critical suicide-related topics, particularly those concerning marginalized communities and racism, were significantly underrepresented in the real-world social media data. The introduction of synthetic data, generated using GPT-3.5 Turbo, and the augmented dataset improved topic diversity. The synthetic dataset showed levels of readability and complexity comparable to those of real data. Furthermore, the incorporation of the augmented dataset in fine-tuning classifiers enhanced their ability to detect suicidal ideation, with the F1-score improving from 0.87 to 0.91 on the University of Maryland Reddit Suicidality Dataset test subset and from 0.70 to 0.90 on the synthetic test subset, demonstrating its utility in improving model accuracy for suicidal narrative detection.

    Conclusions: Our results demonstrate that synthetic datasets can be useful to obtain an enriched understanding of online suicide discussions as well as build more accurate machine learning models for suicidal narrative detection on social media.

    Keywords: large language models; social media; suicidal ideation detection; synthetic data; topic modeling.

    Keywords:suicidal ideation detection; social media posts; topic modeling

    背景: 在社交媒体对话主导的时代,了解自杀(一个重要的公共卫生问题)在线上的讨论方式至关重要。围绕自杀的讨论通常涉及各种话题,例如心理健康挑战、人际关系冲突和财务压力。然而,某些敏感议题,如影响边缘化社区的问题,在这些讨论中可能被低估。这种代表性不足是一个关键的研究问题,因为它主要与服务欠缺的人群(如种族和性少数群体)相关,并且基于此类数据训练的模型在处理这些问题时表现不佳。

    目的: 本研究旨在通过分析在线讨论的话题来弥合自杀意念的心理学文献与社交媒体数据之间的差距。此外,我们希望通过生成合成数据确保用于训练分类器的数据集涵盖了重要的风险因素,并充分代表了被低估或歪曲的主题。这种方法可以提高在社交媒体对话中检测自杀意念所用数据的质量和多样性。

    方法: 首先,我们进行了无监督主题建模来分析来自社交媒体的自杀相关数据并识别数据集中最常讨论的话题。接下来,我们进行了一项范围审查以确定与自杀相关的核心风险因素,并使用这些已识别的风险因素对社交媒体数据集进行指导性主题建模,评估这些因素的存在和覆盖范围。在发现数据中的主题偏差和缺口后,我们探索了使用生成式大型语言模型来创建多样化的合成数据以增强数据的方法。最后,将合成数据集与现实世界的数据集相比,在可读性、复杂性和用于训练机器学习分类器的实用性方面进行了评估。

    结果: 我们的研究发现,某些重要的自杀相关话题(特别是涉及边缘化社区和种族主义的话题)在现实世界的社交媒体数据中显著被低估。使用GPT-3.5 Turbo生成的合成数据及增强的数据集提高了主题多样性。合成数据展示了与实际数据相当的可读性和复杂性水平。此外,在微调分类器时,引入了增强的数据集可以提高其检测自杀意念的能力,F1-得分从0.87提升到0.91(在马里兰大学Reddit自杀倾向数据集中测试子集上),以及从0.70提升到0.90(在合成测试子集上),证明了其提高模型准确性的实用性。

    结论: 我们的研究结果表明,合成数据集可用于获得对在线自杀讨论的丰富理解,并构建更准确的用于社交媒体中检测自杀叙事的机器学习模型。

    关键词:大型语言模型;社交媒体;自杀意念检测;合成数据;主题建模。

    关键词:自杀意念检测; 社交媒体帖子; 主题建模

    翻译效果不满意? 用Ai改进或 寻求AI助手帮助 ,对摘要进行重点提炼
    Copyright © JMIR formative research. 中文内容为AI机器翻译,仅供参考!

    相关内容

    期刊名:Jmir formative research

    缩写:

    ISSN:2561-326X

    e-ISSN:

    IF/分区:2.0/N/A

    文章目录 更多期刊信息

    全文链接
    引文链接
    复制
    已复制!
    推荐内容
    Improving Suicidal Ideation Detection in Social Media Posts: Topic Modeling and Synthetic Data Augmentation Approach