AHNS Abstract: AHNS23

← Back to List


Program Number: AHNS23
Session Name: Scientific Session 5 - Technology & Education
Session Date: Thursday, May 15, 2025
Session Time: 10:15 AM - 11:00 AM

Benchmarking the Competency, Credibility and Consistency of ChatGPT-4o

Yash V Shroff, BA1; Rohith R Kariveda, BA1; Jessica R Levi, MD2; 1Boston University Chobanian & Avedisian School of Medicine; 2Department of Otolaryngology - Head and Neck Surgery, Boston Medical Center

Background: Interest in artificial intelligence (AI) large language models (LLM), most notably ChatGPT (Generative Pre-Trained Transformer), has evolved rapidly within otolaryngology. Given its ability to provide detailed, context-specific responses, several otolaryngology studies have explored its potential for enhancing clinical training and patient education. However, a current gap in our understanding remains on the limitations of ChatGPT and its credibility, which is critical for guiding future applications of AI and LLMs in head and neck cancer care. Hence, the objective of this study was to determine a benchmarked understanding of ChatGPT’s competency and limitations within the realm of head and neck surgery, including its ability to answer different question types, cite traceable and relevant literature, and its intrarater reliability.

Methods: This study assessed OpenAI’s latest free LLM, ChatGPT-4o, using the Head and Neck questions from the American Academy of Otolaryngology - Head and Neck Surgery (AAO-HNS) OTOQuest Knowledge Assessment question bank. These questions were categorized as first or second order, clinical vignette or non-vignette, and management or diagnosis style questions. Each question was independently provided to ChatGPT-4o using a standardized context and prompt, and the model’s answer, explanation with cited sources, and self-reported confidence was recorded. This process was conducted five times, and these responses were analyzed to determine what question types ChatGPT was most adept at answering, if it was capable of providing legitimate and relevant literature in its explanations, and its intrarater reliability to ultimately benchmark the model’s capabilities and limitations.

Results: On average, ChatGPT-4o answered 76.06% of all 61 head and neck questions correctly with a confidence of 4.64 out of 5. The model never gave a confidence of less than 4 even when answering a question incorrectly. It answered 78.23% of first order questions, 74.90% of clinical vignette questions, and 82.5% of management questions correctly. ChatGPT-4o’s responses provided relevant references on average for 80.87% of questions. ChatGPT-4o performed significantly better on clinical management questions versus diagnosis questions (t=5.48, p=0.000589) and better on first-order questions compared to second order (t=2.40, p=0.0413). There was no statistical difference between ChatGPT’s performance on vignette and non-vignette questions (t=1.20, p=0.296). The percent agreement in answer choices between the five iterations of all 61 head and neck questions was 83.6%, and demonstrated a high intrarater reliability (Fleiss’ Kappa: κ = 0.8009, 95% CI: 0.7546-0.8472).

Conclusion: Overall, ChatGPT-4o demonstrated a greater ability to accurately answer head and neck questions and provide relevant, evidence-based references compared to other AI LLMs in previous studies. It also demonstrated high intrarater reliability, a novel finding, indicative of strong internal consistency. However, the model never reported a confidence less than 4, even when answering questions incorrectly, suggesting that it is overconfident. Given the necessity for AI LLM models to recognize their limitations prior to clinical use, ChatGPT’s imperfect accuracy, and its diminished ability to answer multi-order questions, the model requires improvement prior to clinical application. Further areas for investigation include the use of prompt engineering, follow-up questions, and refinement of the model’s self-reported confidence measure.

 

← Back to List