AHNS Abstract: B013

← Back to List


Program Number: B013
Session Name: Poster Session

Comparative Performance of Artificial Intelligence Large Language Model's Knowledge of the 2015 American Thyroid Association Management Guidelines for Adult Patients with Thyroid Nodules and Differentiated Thyroid Cancer

Anthony M Saad, BA; Ariana L Shaari, BA; Celina Zhou, BSE; Rohini Bahethi, MD; Ghayoour S Mir, DO; Rutgers New Jersey Medical School

Objective: To determine the ability of artificial intelligence large language models (LLMs) to provide up-to-date responses to questions regarding management for adult patients with thyroid nodules and differentiated thyroid cancer (DTC).

Study Design: Comparative analysis.

Methods: The 2015 American Thyroid Association Management Guidelines for Adult Patients with Thyroid Nodules and Differentiated Thyroid Cancer was accessed. One hundred and one policies were extracted and broken down by subsection, with a total of 177 recommendations, each with their own strength of recommendation (none, weak, or strong recommendation) and level of evidence. The 177 recommendations were converted into a question format. Each question was independently entered into LLMs ChatGPT 4o and Google Gemini. Two independent reviewers graded the output for concordance and resource credibility between the LLM-response and the guidelines. Concordance and Resource Credibility were graded on a binary scale. Univariate analyses were used to identify statistical associations.

Results: 177 recommendations were extracted, for a total of 177 questions given to both LLMs. Of the total responses, 81.36% of ChatGPT -generated responses were concordant with the 2015 ATA guidelines and 99.43% cited a credible resource. 79.66% of Gemini-generated responses were concordant and 96.05% cited a credible resource. On logistic regression, responses from ChatGPT and Gemini had similar odds of being concordant (OR 1.11, 95% CI [0.66–1.89], p=0.687). Odds for use of credible resources also did not differ between ChatGPT and Gemini responses (OR 7.25, 95% CI [0.88–59.53], p > 0.065). The strength of the recommendation was associated with greater concordance for ChatGPT (OR 6.53, 95% CI [2.34–18.27], p < 0.001), but no significant correlation was found with recommendation strength for Gemini (p > 0.05).

Conclusions: Artificial intelligence LLMs are capable of providing up-to-date information regarding guidelines on thyroid nodule and DTC management, particularly with stronger recommendations. ChatGPT and Gemini have shown similar ability to provide information concordant with guidelines, using credible resources. While this technology is promising for its ability to serve as educational adjuncts, clinicians and trainees must be aware of their limitations.

 

 

← Back to List