Uludağ Üniversitesi Tıp Fakültesi Dergisi, vol.52, pp.7-12, 2026 (TRDizin)
This study aims to evaluate the longitudinal development of general-purpose and specialized artificial intelligence tools in terms of reliability in academic writing and citation accuracy. Eight platforms (ChatGPT, Gemini, QuillBot, Claude, Microsoft Copilot, Elicit, Consensus, and SciSpace) were analyzed using five standardized medical prompts in November 2024 and January 2026. The generated introductions were assessed for reference authenticity using PubMed, Google Scholar, and Web of Science, and for plagiarism using iThenticate. Findings revealed that in November 2024, general-purpose Large Language Models exhibited high hallucination rates, with ChatGPT and Claude providing zero authentic references for certain prompts. Conversely, specialized academic tools like Elicit and SciSpace maintained nearperfect accuracy from the outset. By January 2026, a dramatic improvement was observed, with general-purpose tools like ChatGPT achieving 100% reference accuracy across all categories. Although plagiarism rates were typically below 15%, Gemini recorded a peak of 45% in 2024 before stabilizing. Specialized tools demonstrated superior capacity to manage larger citation volumes, such as SciSpace, which provided 31 verified references in a single output in 2026. While both general and specialized tools have matured significantly, researchers should still exercise caution and use verification protocols. The results indicate that artificial intelligence tools have rapidly transitioned from being prone to academic hallucinations to becoming highly reliable instruments for scholarly literature synthesis.