Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

Please use this identifier to cite or link to this item: https://doi.org/10.1016/j.ebiom.2023.104770

DC Field	Value
dc.title	Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard
dc.contributor.author	Lim, ZW
dc.contributor.author	Pushpanathan, K
dc.contributor.author	Yew, SME
dc.contributor.author	Lai, Y
dc.contributor.author	Sun, CH
dc.contributor.author	Lam, JSH
dc.contributor.author	Chen, DZ
dc.contributor.author	Goh, JHL
dc.contributor.author	Tan, MCJ
dc.contributor.author	Sheng, B
dc.contributor.author	Cheng, CY
dc.contributor.author	Koh, VTC
dc.contributor.author	Tham, YC
dc.date.accessioned	2023-11-17T01:26:10Z
dc.date.available	2023-11-17T01:26:10Z
dc.date.issued	2023-09-01
dc.identifier.citation	Lim, ZW, Pushpanathan, K, Yew, SME, Lai, Y, Sun, CH, Lam, JSH, Chen, DZ, Goh, JHL, Tan, MCJ, Sheng, B, Cheng, CY, Koh, VTC, Tham, YC (2023-09-01). Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. eBioMedicine 95 : 104770-. ScholarBank@NUS Repository. https://doi.org/10.1016/j.ebiom.2023.104770
dc.identifier.issn	2352-3964
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/246018
dc.description.abstract	Background: Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs’ accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information online. Our study evaluated the performance of three LLMs namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard, in delivering accurate responses to common myopia-related queries. Methods: We curated thirty-one commonly asked myopia care-related questions, which were categorised into six domains—pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis. Each question was posed to the LLMs, and their responses were independently graded by three consultant-level paediatric ophthalmologists on a three-point accuracy scale (poor, borderline, good). A majority consensus approach was used to determine the final rating for each response. ‘Good’ rated responses were further evaluated for comprehensiveness on a five-point scale. Conversely, ‘poor’ rated responses were further prompted for self-correction and then re-evaluated for accuracy. Findings: ChatGPT-4.0 demonstrated superior accuracy, with 80.6% of responses rated as ‘good’, compared to 61.3% in ChatGPT-3.5 and 54.8% in Google Bard (Pearson's chi-squared test, all p ≤ 0.009). All three LLM-Chatbots showed high mean comprehensiveness scores (Google Bard: 4.35; ChatGPT-4.0: 4.23; ChatGPT-3.5: 4.11, out of a maximum score of 5). All LLM-Chatbots also demonstrated substantial self-correction capabilities: 66.7% (2 in 3) of ChatGPT-4.0's, 40% (2 in 5) of ChatGPT-3.5's, and 60% (3 in 5) of Google Bard's responses improved after self-correction. The LLM-Chatbots performed consistently across domains, except for ‘treatment and prevention’. However, ChatGPT-4.0 still performed superiorly in this domain, receiving 70% ‘good’ ratings, compared to 40% in ChatGPT-3.5 and 45% in Google Bard (Pearson's chi-squared test, all p ≤ 0.001). Interpretation: Our findings underscore the potential of LLMs, particularly ChatGPT-4.0, for delivering accurate and comprehensive responses to myopia-related queries. Continuous strategies and evaluations to improve LLMs’ accuracy remain crucial. Funding: Dr Yih-Chung Tham was supported by the National Medical Research Council of Singapore (NMRC/MOH/HCSAINV21nov-0001).
dc.publisher	Elsevier BV
dc.source	Elements
dc.subject	ChatGPT-3.5
dc.subject	ChatGPT-4.0
dc.subject	Chatbot
dc.subject	Google Bard
dc.subject	Large language models
dc.subject	Myopia
dc.subject	Humans
dc.subject	Child
dc.subject	Benchmarking
dc.subject	Search Engine
dc.subject	Consensus
dc.subject	Language
dc.subject	Myopia
dc.type	Article
dc.date.updated	2023-11-17T00:35:37Z
dc.contributor.department	DEAN'S OFFICE (DUKE-NUS MEDICAL SCHOOL)
dc.contributor.department	OPHTHALMOLOGY
dc.description.doi	10.1016/j.ebiom.2023.104770
dc.description.sourcetitle	eBioMedicine
dc.description.volume	95
dc.description.page	104770-
dc.published.state	Published
Appears in Collections:	Staff Publications Elements

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
Benchmarking large language models performances for myopia care a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Goog.pdf		648.63 kB	Adobe PDF	OPEN	Published	View/Download

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM