Performance of several large language models when answering common patient questions about type 1 diabetes in children: accuracy, comprehensibility and practicality


Creative Commons License

DENKBOY ÖNGEN Y., AYDIN A. İ., ATAK M., EREN E.

BMC pediatrics, cilt.25, sa.1, ss.799, 2025 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 25 Sayı: 1
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1186/s12887-025-05945-6
  • Dergi Adı: BMC pediatrics
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, EMBASE, Food Science & Technology Abstracts, MEDLINE, Veterinary Science Database, Directory of Open Access Journals
  • Sayfa Sayıları: ss.799
  • Anahtar Kelimeler: Artificial intelligence, ChatGPT, Google gemini, Large Language models, Patient information, Type 1 diabetes
  • Bursa Uludağ Üniversitesi Adresli: Evet

Özet

BACKGROUND: The use of large language models (LLMs) in healthcare has expanded significantly with advances in natural language processing. Models, such as ChatGPT and Google Gemini, are increasingly used to generate human-like responses to questions, including those posed by patients and their families. With the rise in the incidence of type 1 diabetes (T1D) among children, families frequently seek reliable answers regarding the disease. Previous research has focused on type 2 diabetes, but studies on T1D in a pediatric population remain limited. This study aimed to evaluate and compare the performance and effectiveness of different LLMs when answering common questions about T1D. METHODS: This cross-sectional, comparative study used questions frequently asked by children with T1D and their parents. Twenty questions were selected from inquiries made to pediatric endocrinologists via social media. The performance of ChatGPT-3.5 ChatGPT-4 ChatGPT-4o was assessed using a standard prompt for each model. The responses were evaluated by five pediatric endocrinologists interested in diabetes using the General Quality Scale (GQS), a 5-point Likert scale, assessing factors such as accuracy, language simplicity, and empathy. RESULTS: All five LLMs responded to the 20 selected questions, with their performance evaluated by GQS scores. ChatGPT-4o had the highest mean score (3.78 ± 1.09), while Gemini had the lowest (3.40 ± 1.24). Despite these differences, no significant variation was observed between the models (p = 0.103). However, ChatGPT-4o, ChatGPT-4, and Gemini Advanced produced the highest-quality answers compared to ChatGPT-3.5 and Gemini, scoring consistently between 3 and 4 points. ChatGPT-3.5 had the smallest variation in response quality, indicating consistency but not reaching the higher performance levels of other models. CONCLUSIONS: This study demonstrated that all evaluated LLMs performed similarly in answering common questions about T1D. LLMs such as ChatGPT-4o and Gemini Advanced can provide above-average, accurate, and patient-friendly answers to common questions about T1D. Although no significant differences were observed, the latest versions of LLMs show promise for integration into healthcare, provided they continue to be evaluated and improved. Further research should focus on developing specialized LLMs tailored for pediatric diabetes care.