AHNS Abstract: B005

← Back to List


Program Number: B005
Session Name: Poster Session

From Reddit to Review: Leveraging Web Scraping and LLMs to Draw Insights into Head and Neck Cancer

Roshan Dongre1; Faizaan Khan1; Heli Majeethia2; Koyal Ansingkar1; Rahul Alapati, MD3; Omar Ahmed, MD2; Laura Kim, MD2; Nadia Mohyuddin, MD2; 1Texas A&M School of Engineering Medicine; 2Houston Methodist Department of Otolaryngology; 3University of Kansas Department of Otolaryngology

INTRODUCTION: Patients with head and neck cancer experiencing complex symptomatology and treatment-related challenges may seek guidance and support in online communities as a supplement to medical care. The Head and Neck Cancer forum on Reddit, a social media platform, provides an extensive repository of crowd-sourced information where users exchange experiences and offer support. Manual analysis of this vast, unstructured data is impractical, prone to bias, and lacks replicability. This study explored using automated methods such as web scraping and natural language processing to systematically identify current concerns regarding head and neck cancer.

METHODS: A Python-based web scraper was developed to parse the Head and Neck Cancer forum for posts from its inception on August 12, 2018, to October 30, 2024. The extracted data then underwent processing through the OpenAI API to conduct sentiment analysis and thematic categorization, focusing on symptoms, topics, and patient concerns. Specifically, data was queried to identify the topic of each post, the type of cancer being discussed (if applicable), and the dissatisfaction reason if a negative sentiment was identified. Topics were categorized as medication, surgery, diagnosis-related questions, research, and other, while dissatisfaction reasons were categorized as pain, other side effects, lack of support, mental health concerns, and other. Options were not mutually exclusive, and posts could be categorized with multiple options. Outputs were then statistically analyzed using chi-square tests in Python 3.9 to quantify relationships between sentiment and cancer type.

RESULTS: We identified 724 posts from 704 unique authors with 8839 associated comments. Of these posts, 57.46% (n=416) were related to medications, 29.14% (n=211) were users seeking support, 25.41% (n=185) were questions about surgery, 10.08% (n=73) were questions about diagnosis, and 9.81% (n=71) were about research. Regarding cancer classifications, 35.91% (n=260) were about Squamous Cell Carcinoma, with notable subsets including Tonsil Cancer at 14.23% (n=103), Nasopharyngeal Cancer at 2.07% (n=15), and Laryngeal Cancer at 1.38% (n=10). In addition, 0.83% (n=6) of the posts discussed experiences with Adenoid Cystic Carcinoma. Regarding dissatisfaction, 21.3% (n=153) of posts involved pain, 18.37% (n=133) involved side effects from medication and surgery, 8.43% (n=61) mentioned a lack of support, and 4.01% (n=29) involved mental health issues. The chi-square test revealed a statistically significant association between cancer type and the likelihood of expressing negative sentiment (χ² = 23.64, p = 0.0003, df = 5). Among the types of cancer, posts related to adenoid cystic carcinoma and tonsil cancer demonstrated a higher frequency of negative sentiment, with 67% and 57% of posts, respectively. In contrast, only 30% of posts concerning laryngeal cancer exhibited negative sentiment.

CONCLUSION: This analysis of online discussions on head and neck cancer reveals that medication-related questions are most prevalent, with significant interest also focused on support, surgery, and diagnosis. Negative sentiment was notably higher among certain cancers, often citing pain, side effects, and lack of support. This significant association between cancer type and negative sentiment suggests a need for tailored patient care to address these specific concerns.

 

 

← Back to List