AHNS Abstract: B019

← Back to List


Program Number: B019
Session Name: Poster Session

AI in Action: Assessing an evidence-based artificial intelligence model in relation to American Thyroid Association guidelines

Delaney S Clark, BS1; Arati Bendapudi, BS1; Nishat Momin, MD2; Sepehr Shabani, MD2; Orly M Coblens, MD2; Viran Ranasinghe, MD2; 1John Sealy School of Medicine, University of Texas Medical Branch; 2Department of Otolaryngology, University of Texas Medical Branch

Introduction: The use of conversational large language models (LLMs) as artificial intelligence (AI) tools for use by the general public has grown exponentially in recent years, with ChatGPT being the most common household name. These models have the ability to aid in educational endeavors, from improving scientific writing, utilization in healthcare research, and personalized learning based on prior input. OpenEvidence (OE) is an artificial intelligence system that focuses on aggregating and synthesizing clinically relevant evidence into a more easily-accessible and readable format. Because information is only taken from journals and peer-reviewed sources, one would expect outputs from this model to vary from other currently available AI models.

Methods: Guidelines from the American Thyroid Association on differentiated thyroid cancer (DTC) as well as anaplastic thyroid cancer were analyzed and turned into a question. These questions were then used as input for OpenEvidence, with the outputs aggregated and grouped based on level of recommendation (no recommendation, weak, and strong) and evidence (low-quality, moderate-quality, and high-quality) according to the ATA. Two independent reviewers, both attending physicians, reviewed these outputs and graded them on a 4-point Likert scale for both accuracy and completeness. These gradings were then analyzed for agreement between raters and variances between DTC and anaplastic output accuracy.

Results: The DTC outputs scored an average of 3.69/4 for completeness and 3.31/4 for accuracy. Cohen’s kappa to evaluate inter-rater agreement was 0.02 and 0.34 respectively, indicating slight-fair agreement. The anaplastic thyroid cancer outputs scored an average of 3.69/4 for completeness and 3.88/4 for accuracy. Cohen’s kappa for these were 0.53 and 0.72 respectively, indicating moderate agreement. When t-tests for means were completed to compare ratings of completeness and accuracy between DTC and anaplastic groups, variance in completeness score between groups was non-significant, but variance in accuracy scores was significant (p < 0.05). There was no significant difference noted in scores of outputs between level of evidence groups according to the ATA.

Conclusion: OpenEvidence is a novel evidence-based AI model that is available to the general public and uses peer-reviewed journal articles to generate outputs, something that most LLMs do not currently have the capability to do. As shown in our analysis, OE is able to generate complete and accurate answers when asked questions about ATA guidelines in DTC and anaplastic thyroid cancer. The anaplastic thyroid cancer guidelines were last updated in 2021 and DTC in 2015; OE performed significantly better when asked about more recent guidelines, potentially indicating its ability to produce more accurate outputs concerning topics with recent updates to the literature. Our study demonstrates the capability of advancements in AI to interpret and understand surgical guidelines, and while there are still limitations to the accuracy of the information produced, there is potential use as an educational tool for both surgeons and the general public. 

 

 

← Back to List