AHNS Abstract: B012

← Back to List


Program Number: B012
Session Name: Poster Session

Assessing Generative Artificial Intelligence Use and Methodology Reporting in Head and Neck Surgery: A Scoping Review

Isaac L Alter, AB1; Karly Chan2; Katerina Andreadis, MS3; Anaïs Rameau, MD, MS, MPhil4; 1Columbia University Vagelos College of Physicians and Surgeons; 2Harvard College; 3Department of Population Health Sciences, New York University Grossman School of Medicine; 4Sean Parker Institute for the Voice, Department of Otolaryngology-Head and Neck Surgery, Weill Cornell Medical College

Introduction: As interest in large language models (LLMs) has swept across healthcare fields, clinicians and researchers in otolaryngology-head and neck surgery (OHNS) have sought to explore their potential. However, the quality of such inquiries has varied widely, with literature often not including crucial information for reproducibility and replicability, such as prompting approach and generative artificial intelligence (GenAI) model parameters. This has substantial implications for interpretation of these studies, since LLMs are known to generate notably different output based on small differences in “prompt engineering.” This carries particular weight for the field of head and neck cancer (HNC) and head and neck surgery (HNS), in which GenAI has potential applications in determining clinical trial eligibility or providing decision support in treatment planning. Our objective was to (1) review recommendations in clinical informatics for transparent reporting of LLM-based studies, and (2) critically review methodological reporting and quality of LLM-focused literature in HNC and HNS.

Methods: A search strategy was devised with a medical librarian, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses-Scoping Reviews (PRISMA-ScR) guidelines. Databases were searched on October 19, 2024, including PubMed, Embase, Web of Science, ISCA Archive, and IEEE Xplore; gray literature was also searched via arXiv, medRxiv, and engRxiv. All primary studies using LLMs within OHNS were included. Time frame of publication was limited to after November 2022, when OpenAI introduced ChatGPT - the first LLM made publicly accessible - marking the beginning of widespread research in LLMs.

Results: From a pool of 925 unique abstracts retrieved, 124 were included; of these, 27 were focused primarily on HNS or HNC. All 27 studies used a version of ChatGPT; four used fine-tuned ChatGPT, four compared ChatGPT with other LLMs such as Google’s Gemini or Meta’s Llama, and two used multimodal ChatGPT interpreting images. Twelve studies (44%) focused on LLMs’ ability to provide accurate physician-facing treatment recommendations, 10 (37%) assessed LLMs’ answering of patient questions, three (11%) used LLMs to analyze HNS literature or provide references, and two (7%) leveraged LLMs as diagnostic aids using images or imaging reports. Twelve publications (44%) used individual patient data, while the others asked general questions. Only nine studies (33%) published all prompts used to query the LLM, while an additional nine (33%) published one or a subset of prompts, and only 14 (52%) included a definitive count of the number of prompts used. Sixteen studies (59%) provided a justification of how their prompts were developed, while only six (22%) reported testing or refining their prompts before querying the LLM, and none described how the number of prompts was decided. Six publications (22%) specified the number of times each prompt was run. Eight papers (30%) mentioned algorithmic bias or ethical concerns surrounding GenAI use.

Conclusions: LLM-focused literature in HNS, while exploring many potentially fruitful avenues, demonstrates overall poor adherence to current recommendations for methodological reporting. This severely limits the generalizability of published studies in this field, and suggests that best practices could be further disseminated and enforced by researchers and journals.

 

 

← Back to List