A Comparative Evaluation of Three Large Language Models for Parent-Centered Questions About Anorexia Nervosa

Yeşilkaya, Celal; Keleş, Hande; Taşpolat, Esra; MUTLU, CANER; TURAN, SERKAN

doi:10.1002/eat.70118

A Comparative Evaluation of Three Large Language Models for Parent-Centered Questions About Anorexia Nervosa

Yeşilkaya C., Keleş H. K., Taşpolat E. R., MUTLU C., TURAN S.

International Journal of Eating Disorders, 2026 (SCI-Expanded, SSCI, Scopus)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2026
Doi Numarası: 10.1002/eat.70118
Dergi Adı: International Journal of Eating Disorders
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, BIOSIS, CINAHL, EMBASE, MEDLINE, Psycinfo, Social Sciences Abstracts
Anahtar Kelimeler: anorexia nervosa, artificial intelligence, child and adolescent mental health, digital health, health information, large language models
Bursa Uludağ Üniversitesi Adresli: Evet

Özet

Background: Large language models (LLMs) are increasingly used to obtain health information, including guidance on child and adolescent mental health. In anorexia nervosa (AN), where early recognition and timely intervention are critical, the accuracy of AI-generated information available to parents may have important clinical implications. This study evaluated the performance of LLMs in responding to parent-oriented questions about AN. Methods: A comparative model evaluation was conducted using three conversational AI systems: ChatGPT (GPT-4o), Google Gemini, and DeepSeek. Twenty questions representative of those frequently asked by parents of adolescents with AN were identified through online content exploration and expert review. Each question was submitted using standardized prompts in separate chat sessions. Responses were anonymized and independently evaluated by two board-certified child and adolescent psychiatrists across three dimensions: quality, usefulness, and reliability. Reproducibility was assessed through repeated queries in separate sessions. Results: All three models demonstrated generally high levels of reliability and overall informational performance. ChatGPT achieved the highest overall accuracy (≈92%) and reproducibility (≈90%), followed by Gemini (≈88% accuracy; ≈85% reproducibility) and DeepSeek (≈86% accuracy; ≈83% reproducibility). Domain-level analysis showed lower accuracy across models for diagnosis and clinical assessment questions. Qualitative error analysis indicated that the omission of clinically relevant information was the most common limitation, while DeepSeek produced more factual inaccuracies and Gemini more generalized guidance. Conclusions: LLMs may provide broadly accurate preliminary information for parents seeking guidance about AN. However, persistent omissions and domain-specific variability highlight important limitations. AI-generated information should therefore be regarded as a complementary resource rather than a replacement for professional clinical guidance.