Abstract
Objectives
This study compared the readability of patient education materials from the Turkish Ophthalmological Association (TOA) retinopathy of prematurity (ROP) guidelines with those generated by large language models (LLMs). The ability of GPT-4.0, GPT-4o mini, and Gemini to produce patient education materials was evaluated in terms of accuracy and comprehensiveness.
Materials and Methods
Thirty questions from the TOA ROP guidelines were posed to GPT-4.0, GPT-4o mini, and Gemini. Their responses were then reformulated using the prompts “Can you revise this text to be understandable at a 6th-grade reading level?” (P1 format) and “Can you make this text easier to understand?” (P2 format). The readability of the TOA ROP guidelines and the LLM-generated responses was analyzed using the Ateşman and Bezirci-Yılmaz formulas. Additionally, ROP specialists evaluated the comprehensiveness and accuracy of the responses.
Results
The TOA brochure was found to have a reading level above the 6th-grade level recommended in the literature. Materials generated by GPT-4.0 and Gemini had significantly greater readability than the TOA brochure (p<0.05). Adjustments made in the P1 and P2 formats improved readability for GPT-4.0, while no significant change was observed for GPT-4o mini and Gemini. GPT-4.0 had the highest scores for accuracy and comprehensiveness, while Gemini had the lowest.
Conclusion
GPT-4.0 appeared to have greater potential for generating more readable, accurate, and comprehensive patient education materials. However, when integrating LLMs into the healthcare field, regional medical differences and the accuracy of the provided information must be carefully assessed.
Introduction
Retinopathy of prematurity (ROP) is a vasoproliferative and multifactorial disease of the retina. It is primarily observed in preterm infants but can also occur in full-term infants who have received high levels of oxygen therapy.1 Advances in neonatal care have increased survival rates for preterm infants, which has resulted in more frequent encounters with conditions such as ROP. Annually, approximately 15 million babies worldwide are born prematurely (before 37 completed weeks of gestation).2 Each year, between 23,800 and 45,600 infants are reported to suffer from irreversible vision loss as a result of ROP.3 Particularly in low- and middle-income countries, up to 40% of childhood blindness is attributed to preventable ROP cases, and Türkiye is one of these countries.4 A multicenter study conducted in Türkiye revealed that among 6,115 preterm infants, 27% were diagnosed with some stage of ROP, and 6.7% developed severe ROP.5
ROP can be effectively managed with consistent monitoring and prompt therapy.6, 7 Monitoring commences soon after delivery and continues until retinal vascularization is fully established. The follow-up frequency is modified according to the severity of the disease; infants with severe ROP are followed on a weekly basis, while others are seen at extended intervals. However, delays in follow-up might lead to lost treatment opportunities and ultimately result in complete blindness.8 The dissemination of comprehensive information regarding the disease and treatment process to families is of utmost importance, as it greatly enhances their compliance with follow-up and treatment. Previous research has demonstrated that increased levels of knowledge within families are correlated with less anxiety and improved adherence to treatment regimens.9, 10
In Türkiye, the Turkish Ophthalmological Association (TOA) offers patient education resources and informed consent forms for a range of disorders on its official website. It is crucial to ensure that these materials are comprehensible to facilitate patients’ information-gathering process.11 Per the guidelines of the American Medical Association and the National Institutes of Health, patient education materials should be produced at a reading level equivalent to that of a 6th-grade student.12 Various formulas which analyze factors such as sentence length and word structure are frequently employed to evaluate readability.13 For Turkish texts, readability is commonly determined using the Ateşman14 and Bezirci and Yılmaz15 readability formulas.
Over the past few years, online information sources have emerged as readily available tools that patients often favor greatly. A survey conducted by the Pew Center reveals that 61% of persons in the United States actively access health information through internet platforms.16 Nevertheless, it is widely recognized that the comprehensibility of online health information generally necessitates a greater degree of education.17, 18, 19 Large language models (LLMs) are artificial intelligence systems trained using content available on the internet to generate texts in natural language.20 Machine-learning models such as OpenAI’s ChatGPT and Google’s Gemini are being employed in the medical domain to provide patient education and create informative content.21, 22 Nevertheless, the dependability of these models is still a topic of contention, and further investigation is now being conducted.23
This research examined the readability levels of ROP patient education materials, structured in a question-and-answer format, available on the TOA website using the Ateşman and Bezirci-Yılmaz formulas. Thirty questions from these materials were posed to the advanced language models GPT-4.0, GPT-4o mini, and Gemini, and the responses were used to generate patient brochures. The readability, accuracy, and comprehensiveness of these brochures were then evaluated to assess the models’ effectiveness in producing patient education materials.
Materials and Methods
The main data source for this study consisted of informational brochures created for families regarding the treatment guidelines for ROP, which can be obtained from the TOA website (https://www.todnet.org/tod-rehber/rop-tedavi-rehberi-2021.pdf, available in Turkish: Appendix 1: Informational Brochure for Families: Retinopathy of Prematurity Screening, Appendix 2: Informational Brochure for Families: Retinopathy of Prematurity Treatment).24 The guidelines comprise 30 questions pertaining to ROP, such as “What is ROP?” and “How is ROP treated?”, along with their accompanying responses. An independent analysis was conducted on each response from the guidelines using the Ateşman and Bezirci-Yılmaz readability formulas. Since our study used only publicly available data and literature and did not entail the use of any animal or human data, ethics committee approval and patient consent were not required.
Use of Large Language Models
In this study, 30 questions from the TOA ROP guidelines were posed to the ChatGPT-4.0, ChatGPT-4o mini, and Gemini models. Table 1 presents sample questions directed to the artificial intelligence tools used in this study. Each question was asked in a new chat session, and the responses were recorded. Additionally, the ability of LLMs to simplify texts for lower educational levels was evaluated. To assess this, the models were given their initial responses (initial format) with prompts to generate two new responses:25
Prompt 1: “Can you revise the following text to make it understandable at a 6th-grade reading level?” (P1 format).
Prompt 2: “Can you revise the following text to make it easier to understand?” (P2 format).
Each response was analyzed individually using the Ateşman and Bezirci-Yılmaz readability formulas.
Readability Criteria
Ateşman Readability Formula: The Ateşman formula provides a score between 0 and 100 based on average sentence and word length. We conducted the Ateşman analysis using an online program. The scoring system is categorized as follows: 90-100 points correspond to a 4th-grade level or below, 80-89 points to a 5th- or 6th-grade level, 70-79 points to a 7th- or 8th-grade level, 60-69 points to a 9th- or 10th-grade level, 50-59 points to an 11th- or 12th-grade level, 40-49 points to an associate-degree level, 30-39 points to an undergraduate-degree level, and 29 points or below to a postgraduate-degree level.14
Bezirci-Yılmaz Readability Formula: The Bezirci-Yılmaz formula evaluates readability based on average sentence length and the number of syllables in words. The Bezirci-Yılmaz analysis was conducted using a specialized software tool. The scoring system is as follows: 1-8 points correspond to the primary-school level, 9-12 points to the high-school level, and 12-16 points to the undergraduate level; scores above 16 indicate readability appropriate for academic-level texts.15
Comprehensiveness and Accuracy of Patient-Targeted Information Produced by Large Language Models
The responses generated by LLMs were evaluated for comprehensiveness and accuracy based on the TOA ROP guidelines. Experts specialized in ROP and experienced in its clinical management (S.A.P. and A.D.) assessed the accuracy and comprehensiveness of the responses. The comprehensiveness of the answers was rated as follows:26
• 1 point: Insufficiently comprehensive (misses crucial information)
• 2 points: Somewhat comprehensive (contains minimal but necessary information)
• 3 points: Moderately comprehensive (provides a reasonable level of detail)
• 4 points: Comprehensive (includes critical information)
• 5 points: Very comprehensive (provides detailed and complete information)
The responses were evaluated for accuracy as follows:27
• 1 point: Poor (includes substantial inaccuracies and may be detrimental to patients)
• 2 points: Moderate (some inaccuracies but not likely to pose adverse effects for patients)
• 3 points: Excellent (free of errors)
Statistical Analysis
In the data analysis, one-way analysis of variance (ANOVA) was used for comparison of multiple means, followed by post-hoc Tukey’s honestly significant difference test to identify significant pairwise differences. Statistical analyses were conducted using SPSS software (IBM SPSS Statistics, Version 26.0). A p value of <0.05 was considered statistically significant.
Results
Bezirci-Yılmaz Readability Scores
The Bezirci-Yılmaz readability analysis revealed that the texts initially produced by GPT-4.0 and Gemini had a notably lower reading level than those in the TOA brochure (p=0.010 and p=0.039, respectively). No statistically significant difference was found between the materials generated by GPT-4o mini and the TOA brochure (p=0.325). No statistically significant differences were found in the comparisons among the other groups (Table 2).
When comparing the initial responses of the LLMs (GPT-4.0, Gemini, and GPT-4o mini) with their responses in the P1 and P2 formats, a statistically significant increase in readability was observed only in the responses of GPT-4.0 (p=0.005 and p=0.012, respectively). No significant differences were found in the other groups. Additionally, no statistically significant differences were observed between the responses in the P1 and P2 formats within any of the LLM groups (p>0.05) (Table 3).
Ateşman Readability Scores
When examining the Ateşman readability scores, the initial responses generated by GPT-4.0 and Gemini were found to have significantly lower reading levels compared to the TOA brochure (p=0.016 and p=0.006, respectively). No significant difference was found between GPT-4o mini and the TOA brochure (p=0.910). Additionally, GPT-4.0 and Gemini showed significantly lower reading levels compared to GPT-4o mini (p=0.042 and p=0.035, respectively). However, no significant difference was observed between GPT-4.0 and Gemini (Table 2).
None of the LLMs’ initial responses showed any statistically significant difference in Ateşman readability score when compared to their responses in the P1 and P2 formats. Furthermore, there were no notable disparities noted between the P1 and P2 formats for any of the models (Table 4). The reading level of the other LLMs groups was assessed to be at the 9th- to 10th-grade level, whereas the responses produced by GPT-4o mini were determined to be at the 11th- to 12th-grade level.
Comprehensiveness Scores
When comparing the comprehensiveness scores of the initial responses from the LLMs, the responses generated by GPT-4.0 were found to have a significantly higher level of comprehensiveness compared to those from GPT-4o mini and Gemini (p=0.045 and p=0.001, respectively). However, no significant difference in comprehensiveness was observed between GPT-4o mini and Gemini. The comprehensiveness scores of GPT-4.0’s responses in the P1 and P2 formats were higher than those of GPT-4o mini and Gemini (Table 5).
Accuracy Scores
When comparing the accuracy scores of the initial responses from the LLMs, GPT-4.0’s accuracy scores were found to be statistically significantly higher than those of Gemini (p=0.001). However, no significant difference in accuracy was observed between GPT-4o mini and Gemini or GPT-4.0. When comparing the accuracy scores of responses in the P1 and P2 formats, GPT-4.0 was significantly more accurate than Gemini (p=0.039 and p=0.034, respectively). No other statistically significant differences were observed (Table 5).
Discussion
In this study, the readability of patient education materials in the TOA ROP treatment guidelines was assessed. According to the Bezirci-Yılmaz readability formula, the materials were at an average high-school level, whereas the Ateşman readability formula placed them at 11th or 12th grade. Research conducted in Türkiye revealed the average education level to be 6.51 years.28 When creating patient education materials, it is important to consider the average education level of each country.29 In the literature, the recommended reading level for patient education materials is often at the 6th-grade level.12 Materials that exceed this level may be difficult to interpret for patient populations with limited health literacy, potentially reducing treatment adherence. Therefore, the reading level of the TOA ROP guidelines is higher than suggested for patient education materials, indicating that they should be simplified. A similar problem occurred with the materials produced by ChatGPT-4.0, ChatGPT-4o mini, and Gemini. The reading levels of these materials were determined to be above the recommended level, not aligned with the norms stated in the literature.30, 31
Delays in the treatment of ROP can lead to irreversible vision loss as well as significant medicolegal issues for healthcare professionals.32 The most common issue in malpractice cases related to ROP is the failure to perform timely screening or follow-up.33 One of the main reasons for this is that families do not have sufficient knowledge about ROP and the screening process. Studies in the literature have shown that when parents are informed and made aware, adherence to treatment improves and their infants have better outcomes.9, 10 In one study, it was reported that the parents of very low birth weight infants, especially those with limited English proficiency and poor health literacy, were not adequately informed about ROP, which negatively impacted treatment.34 The study showed that more than half of parents did not receive adequate information about their infant’s ROP condition upon discharge. One reason for this information gap is that 1 in 10 adults in the United States has low health literacy.2
An analysis conducted in the domain of pediatric ophthalmology revealed that online patient education materials were suitable for an audience with an average educational attainment of 11.75±2.72 years.34 Insufficient comprehensibility of this educational material may result in inadequate compliance with therapy among persons with limited health literacy. Hence, it is imperative to provide patient education materials that are easily understandable for individuals with lower knowledge levels. According to the data collected in our study, the TOA guidelines for ROP are written at an unacceptably high reading level. Therefore, it is necessary to enhance the comprehensibility of these materials.
In this study, when comparing the readability levels of the brochures generated by GPT-4.0, GPT-4o mini, and Gemini with the TOA brochure, GPT-4.0 and Gemini were found to have lower readability levels compared to the TOA brochure. Additionally, in the P1 and P2 formats, which were designed to improve comprehensibility, an increase in readability (as assessed by Bezirci-Yılmaz score) was observed for the brochure created by GPT-4.0, while no significant changes were observed for Gemini or GPT-4o mini. These findings are consistent with the literature.27, 35, 36 In terms of readability, these findings indicate that GPT-4.0 may be a more appropriate choice for creating a Turkish ROP guide.
LLMs are developing as new and intriguing instruments in the healthcare sector. They show potential particularly in patient consultation, medical triage, and providing information. LLMs can enhance access to healthcare by answering common medical questions from patients and improving care for individuals in remote or underserved areas.22, 37 Furthermore, these models have been observed to take on administrative tasks, allowing healthcare professionals to dedicate more time to patient care.38 However, the use of LLMs presents certain challenges. LLMs may provide inaccurate information, posing a risk to patients and their families, particularly in medical settings.39 These models have limited capacity for self-checking their responses and correcting errors. Misleading or incomplete information could lead to medical errors, posing serious risks to patient safety.40 In order to fully integrate LLMs into clinical practice, further improvements in validation processes and stricter oversight of these models are essential.
Patient education materials must not just be easy to read, they must also be thorough and accurate. In our study, we also looked at the accuracy and comprehensiveness of the LLM-generated brochures. The results showed that the GPT-4.0 materials were more complete than the GPT-4o mini and Gemini materials. In terms of accuracy, GPT-4.0 scored highest, while Gemini received the lowest scores. These data indicate that GPT-4.0 could be a more trustworthy model for creating patient education materials. Similarly, Pushpanathan et al.26 found that GPT-4.0 outperformed both GPT-3.5 and Google Bard in terms of accuracy and comprehensiveness when answering complex ocular symptom queries, highlighting its potential in patient education. Antaki et al.21 also reported that GPT-4.0 provided more consistent and relevant medical information compared to other LLMs in ophthalmology, underscoring its utility in generating reliable educational materials.
Another concern about the medical information offered by LLMs is the possibility of geographic variations in the data. Screening criteria for ROP may differ by country.2 While some criteria may not be met in developed nations, the risk of severe ROP is higher in less developed countries.39 The TOA ROP guidelines recommend screening all newborns delivered before 34 weeks of gestation or weighing less than 1,700 grams.5 GPT-4.0’s response for this question (“infants born before 30 weeks or weighing less than 1,500 grams”) was comparable to the screening criteria employed in the United Kingdom but not with the TOA standards for Türkiye.41 This disparity may generate uncertainty among patient relatives, potentially leading to misinformation and lower adherence to therapy.
Study Limitations
One of the major limitations of our study is the variability in the performance of language models across different languages. In our study, we asked questions in Turkish and requested that the responses be provided in Turkish as well. Additionally, we asked the language models to produce responses that were more understandable than those from Turkish sources. However, since LLMs are typically trained on English data, they may not perform as effectively in languages like Turkish. This discrepancy can be attributed to differences in linguistic structures and the limited availability of Turkish datasets.20 It has also been noted in the literature that LLMs tend to show reduced performance when generating medical information in less-represented languages, which can increase the risk of errors in clinical applications.42 Furthermore, the questions were posed as they appear in the TOA brochure, without the additional context of being asked from the perspective of a user in Türkiye. As such, the potential impact of including a phrase like “I am asking for Türkiye” on the model’s responses was not evaluated. Therefore, the use of these models in languages such as Turkish requires careful consideration and should be supported by validation processes conducted by local experts.
Conclusion
Educating patients and their families is critical in the management of ROP. The reading level of TOA patient information pamphlets was determined to be higher than the acceptable level. In terms of readability, comprehensiveness, and accuracy, GPT-4.0 brochures outperformed GPT-4o mini and Gemini brochures. While LLMs are a promising tool in healthcare, it has been discovered that some information may be misleading, and there is a risk of misdirection owing to geographical variations. As a result, the integration of LLMs into healthcare should be thoroughly tested and supported by relevant recommendations. It has been determined that the accuracy of information generated by LLMs, particularly essential medical information, must be carefully assessed.