TY - JOUR
T1 - Focused digital cohort selection from social media using the metric backbone of biomedical knowledge graphs
AU - Guo, Ziqi
AU - Felag, Jack
AU - Rozum, Jordan C.
AU - Correia, Rion Brattig
AU - Wang, Xuan
AU - Rocha, Luis M.
N1 - Publisher Copyright:
© 2025
PY - 2025/8
Y1 - 2025/8
N2 - Social media data allows researchers to construct large digital cohorts — groups of users who post health-related content — to study the interplay between human behavior and medical treatment. Identifying the users most relevant to a specific health problem is, however, a challenge in that social media sites vary in the generality of their discourse. While X (formerly Twitter), Instagram, and Facebook cater to wide ranging topics, Reddit subgroups and dedicated patient advocacy forums trade in much more specific, biomedically-relevant discourse. To filter relevant users on any social media, we have developed a general method and tested it on epilepsy discourse. We analyzed the text from posts by users who mention epilepsy drugs at least once in the general-purpose social media sites X and Instagram, the epilepsy-focused Reddit subgroup (r/Epilepsy), and the Epilepsy Foundation of America (EFA) forums. We used a curated medical terminology dictionary to generate a knowledge graph (KG) from each social media site, whereby nodes represent terms, and edge weights denote the strength of association between pairs of terms in the collected text. Our method is based on computing the metric backbone of each KG, which yields the (sparsified) subgraph of edges that participate in shortest paths. By comparing the subset of users who contribute to the backbone to the subset who do not, we show that epilepsy-focused social media users contribute to the KG backbone in much higher proportion than do general-purpose social media users. Furthermore, using human annotation of Instagram posts, we demonstrate that users who do not contribute to the backbone are much more likely to use dictionary terms in a manner inconsistent with their biomedical meaning and are rightly excluded from the cohort of interest. Our metric backbone approach, thus, has several benefits: it yields focused user cohorts who engage in discourse relevant to a targeted biomedical problem; unlike engagement-based approaches, it can retain low-engagement users who nonetheless contribute meaningful biomedical insights and filter out very vocal users who contribute no relevant content, it is parameter-free, algebraically principled, does not require classifiers or human-curation, and is simple to compute with the open-source code we provide.
AB - Social media data allows researchers to construct large digital cohorts — groups of users who post health-related content — to study the interplay between human behavior and medical treatment. Identifying the users most relevant to a specific health problem is, however, a challenge in that social media sites vary in the generality of their discourse. While X (formerly Twitter), Instagram, and Facebook cater to wide ranging topics, Reddit subgroups and dedicated patient advocacy forums trade in much more specific, biomedically-relevant discourse. To filter relevant users on any social media, we have developed a general method and tested it on epilepsy discourse. We analyzed the text from posts by users who mention epilepsy drugs at least once in the general-purpose social media sites X and Instagram, the epilepsy-focused Reddit subgroup (r/Epilepsy), and the Epilepsy Foundation of America (EFA) forums. We used a curated medical terminology dictionary to generate a knowledge graph (KG) from each social media site, whereby nodes represent terms, and edge weights denote the strength of association between pairs of terms in the collected text. Our method is based on computing the metric backbone of each KG, which yields the (sparsified) subgraph of edges that participate in shortest paths. By comparing the subset of users who contribute to the backbone to the subset who do not, we show that epilepsy-focused social media users contribute to the KG backbone in much higher proportion than do general-purpose social media users. Furthermore, using human annotation of Instagram posts, we demonstrate that users who do not contribute to the backbone are much more likely to use dictionary terms in a manner inconsistent with their biomedical meaning and are rightly excluded from the cohort of interest. Our metric backbone approach, thus, has several benefits: it yields focused user cohorts who engage in discourse relevant to a targeted biomedical problem; unlike engagement-based approaches, it can retain low-engagement users who nonetheless contribute meaningful biomedical insights and filter out very vocal users who contribute no relevant content, it is parameter-free, algebraically principled, does not require classifiers or human-curation, and is simple to compute with the open-source code we provide.
KW - Epilepsy
KW - Network science
KW - Network sparsification
KW - Patient cohort selection
KW - Social media mining
UR - http://www.scopus.com/inward/record.url?scp=105007431556&partnerID=8YFLogxK
U2 - 10.1016/j.jbi.2025.104847
DO - 10.1016/j.jbi.2025.104847
M3 - Article
C2 - 40460925
AN - SCOPUS:105007431556
SN - 1532-0464
VL - 168
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
M1 - 104847
ER -