Exploring Biases of Large Language Models in the Field of Mental Health: Comparative Questionnaire Study of the Effect of Gender and Sexual Orientation in Anorexia Nervosa and Bulimia Nervosa Case Vignettes


IntroductionLarge Language Models in the Context of Mental Health

In recent years, there has been significant progress in the field of artificial intelligence (AI) []. In particular, the development of large language models (LLMs), such as OpenAI’s GPT models [], Google’s LaMDA [], or Meta’s Large Language Model Meta AI (LLaMA) [], has made the deployment of such algorithms accessible to researchers, clinicians, and the public alike []. With advancements in computational power and access to larger datasets, these models can now go beyond simple word counting [] and actually account for the relationships between words [,]. The technique of modeling words in a large context has been referred to as transformer-based large language modeling []. This may not only facilitate the automatic analysis of large amounts of text data [,] but, by modeling words in a large context, also allow the generation of meaningful text and the interactive use of this technology [,]. Thus, the application of LLMs may improve efficiency and effectiveness of data processing in various fields—including health care [].

Since psychology and psychotherapy research are primarily shaped by language, the potential of LLMs in this field is significant [,]. This becomes even more meaningful when considering the contribution of mental disorders to the global disease burden [] and acknowledging the persistent treatment gap in mental health care []. Especially in the field of psychological assessment, research on the use of LLMs is advanced []. For example, the use of transformer language models on language patterns has resulted in remarkably high predictive accuracy on standardized well-being rating scales []. This procedure of using LLMs to automatically generate psychological construct scores based on free text has been formally referred to as “language-based assessment” [,]. Findings indicate comparable levels of validity and reliability of language-based assessments compared with standardized rating scales [,]. Moreover, language-based assessments have the capacity to incorporate additional information beyond free text entries [], such as user age [].

LLMs have also been applied in the evaluation of clinical case vignettes, and ChatGPT-4 has been shown to assess suicidality as reliable as mental health professionals []. Furthermore, Chat-GPT 3.5’s performance in the diagnostic assessment and advice on disease management in a study using 100 clinical vignettes has been rated as excellent by mental health professionals [].

Biases and Responsible AI

Despite the promising findings of using LLMs in the context of (mental) health, the issue of potential biases in information generated by LLMs has been raised. Because LLMs are being increasingly introduced in clinical practice, it is important to investigate potential biases to ensure a responsible use of AI [] and LLMs []. Since LLMs rely on training data, which is directly or indirectly generated by humans, these models are likely to contain the same biases as the society in which they are created in [-]. This is especially critical in (mental) health care [], where biases in LLMs may lead to discrimination of different social groups []. For example, ChatGPT 3.5 performed poorly in diagnosing an infectious disease known to be widely underdiagnosed []. Furthermore, ChatGPT 3.5 made different treatment recommendations based on insurance status, which might introduce health disparities []. When generating clinical cases, ChatGPT-4 failed to create cases that depicted demographic diversity and relied on stereotypes when choosing gender or ethnicity []. Thus, the need for “fair AI” has been pointed out with the goal to develop prediction models that provide equivalent outputs for identical individuals who differ only in one sensitive attribute []. To avoid or at least reduce potential bias and move toward fair AI, this bias first needs to be conceptualized, measured, and understood []. The aim of this paper was to explore a potential bias in the evaluation of eating disorders (EDs), which have been subjected to stigma [] and gender-biased assessment [].

EDs (Anorexia Nervosa or Bulimia Nervosa)

Anorexia nervosa (AN) and bulimia nervosa (BN) are severe EDs with many medical complications, high mortality rates [], slow treatment progress, and frequent relapses []. The lifetime prevalence to develop AN or BN is estimated to be 1%‐2% each []. Historically, AN and BN have been described only in women, and it was not until the 21st century that research started to systematically investigate EDs in men []. Today, men are estimated to account for approximately 10%‐25% of AN or BN cases [,]. Research on gender difference in AN and BN is scarce and inconclusive, with no clear findings with regard to genetic and environmental factors that might explain differences in etiology or maintenance of these EDs []. Likewise, findings on severity and treatment outcomes are ambiguous. While one study suggests that men diagnosed with AN might have faster and more frequent remission rates [], another study found no difference []. Men might produce lower costs in outpatient treatment; however, this might be due to higher barriers to receive treatment []. Men have been found to be more stigmatizing than women toward people with EDs [], and this internalized stigma might be one reason for the hesitancy to seek outpatient treatment.

In men, sexual orientation might increase the risk of developing an ED, with more men with an ED or ED-related behavior identifying as homosexual compared with the general population [,]. Furthermore, independent of being diagnosed with an ED, homosexual men report more psychological distress than heterosexual men, and in men with an ED, being homosexual was related to higher ED symptomatology []. In women, a review found no significant difference in overall disordered eating due to sexual orientation, but distinct symptom patterns, with homosexual women reporting less restrictive eating behavior and more binge eating [].

To conclude, only in the last 2 decades men were included in ED research and there are still many open questions related to the effect of gender on prevalence, symptoms, and treatment outcomes of AN and BN. With regard to sexual orientation, there is evidence for an association between identifying as homosexual and a higher risk of EDs in men but not in women.

Objectives

We aimed to estimate the presence and size of bias related to gender and sexual orientation produced by ChatGPT-4, a common LLM, as well as MentaLLaMA, an LLM fine-tuned for the mental health domain, exemplified by their application in the context of ED symptomatology and health-related quality of life (HRQoL) of patients with AN or BN. By providing clinical case vignettes to the LLMs and instructing them to take up the role of a clinical psychologist rating the vignettes, we sought to mimic the diagnostic process of an LLM-based ED assessment.


MethodsVignette Selection and Modification

We searched PubMed and Google Scholar up until October 2023 for vignettes in scientific papers published since 2000 that describe patients with either AN or BN. A total of 30 case vignettes were extracted from 12 different papers (published between 2001 and 2022). Of these vignettes, 22 described patients with AN and 8 described patients with BN. Most vignettes originally describe a female patient (n=28). We then adapted gender and sexual orientation in each vignette to create 4 versions (2 × 2 design), describing a female versus male patient living with their female versus male partner (if either a marriage or age ≥30 years was mentioned, the term husband or wife was chosen, otherwise boyfriend or girlfriend). This resulted in 120 adopted vignettes. Some information was removed due to content policy violations, that is, drug abuse, self-mutilation, suicidal ideation or suicide attempts, sexual abuse, and traumatizing experiences. Furthermore, details on the menstrual cycle were removed since they do not apply to male patients, as well as indications of height, since they were unrealistically short for male patients. Finally, some specific details not needed in this context were removed, for example, study enrollment procedures and study-specific measures, medication plan, and the name of the hospital.

See for further details about the vignettes.

Table 1. Vignettes included in the study, search term, and information on parts that were removed, added, or changed.VignetteSearch termRemovedChangedAdded1 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score—Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend2 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score self-mutilation, suicide attempt—Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend3 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score, amenorrhea—Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend4 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score—Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend5 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score“School” changed to “university”Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend6 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score“School” changed to “university”Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend7 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score—Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend8 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score“School” changed to “university”Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend9 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score“Living with parents” changed to “living with boyfriend or girlfriend”Patient with AN (implied in title of paper), sex, and sexual orientation10 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score, amenorrhea—Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend11 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score, amenorrhea—Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend12 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score, amenorrhea, and suicide attempts—Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend13 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score, amenorrhea, and suicide attempts—Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend14 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000GAF score, amenorrhea, suicide ideation, and self-mutilation—Patient with AN (implied in title of paper), sex, sexual orientation, and living with boyfriend or girlfriend15 []PubMed, August 11, 2023: eating disorder filter for “case report,” since 2000Menses, not sexually active—Living with boyfriend or girlfriend16 []PubMed, August 11, 2023: eating disorder filter for “case report,” since 2000Medication details, menstrual cycleLiving with boyfriend or girlfriend17 []PubMed, August 11, 2023: eating disorder filter for “case report,” since 2000Menstrual cycle—Living with husband or wife (>30 years)18 []PubMed, August 11, 2023: eating disorder filter for “case report,” since 2000Sexual abuse, drugs or alcohol, suicide—Living with husband or wife19 []PubMed, August 11, 2023: eating disorder filter for “case report,” since 2000——Living with boyfriend or girlfriend20 []PubMed, August 11, 2023: eating disorder filter for “case report,” since 2000Suicidal ideation—Living with husband or wife (>30 years)21 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000Substance abuse—Living with boyfriend or girlfriend22 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000Diagnostic manual and citation, name of measure, scientific consent, treated by author, and height (unrealistic if changed to male sex)—Living with boyfriend or girlfriend23 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000City, education, menstrual irregularities, and weight (unrealistic if changed to male sex)—Living with boyfriend or girlfriend24 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000PTSD, sexual abuse, mens, and study—Living with boyfriend or girlfriend25 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000Sexual abuse, PTSD, and mens or menopause—Living with husband or wife (>30 years)26 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000Enrollment in study—Living with boyfriend or girlfriend27 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000——Living with husband or wife (>30 years)28 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000——Living with boyfriend or girlfriend29 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000——Living with husband or wife (>30 years)30 []Google Scholar; October 13, 2023: (“case report” OR “case series”) AND (anorexia OR bulimia) AND “psychotherapy,” since 2000Height (unrealistic if changed to male sex)“Single” changed to “living with boyfriend or girlfriend”—

aGAF: global assessment of functioning.

bNot applicable.

cAN: anorexia nervosa.

dPTSD: posttraumatic stress disorder.

Ethical Considerations

We did not collect any data from human subjects within our study but instead conducted analyses on case vignettes that were previously published in a fully anonymized way in peer-reviewed, easily accessible journals. Therefore, no ethics application was required for this study.

Data Generation

In 3 rounds, each vignette was fed into ChatGPT-4 with the instruction to evaluate them by providing responses to 1 of the 2 psychometric instruments. This resulted in a total of 720 vignette evaluations (120 vignettes × 3 rounds × 2 measures). ChatGPT-4 was opened in an internet browser (Google Chrome) with the chat history turned off to avoid a learning effect from the repeated evaluation of case vignettes. In the “custom instructions” settings, the instruction “Set the temperature of your replies to 0” was included. This instruction minimizes randomness in the text generation process and ensures maximum replicability, high precision, and factual accuracy. Data were generated between October and December 2023. See for an example of a prompt. Data generation in MentaLLaMA had to be substantially adapted ().

Textbox 1. An example prompt for 1 of the 120 vignettes.

Take up the role of a clinical psychologist. Imagine that you see a patient described by the following case vignette.

“A 21-year-old university student living with her boyfriend self-refers with concerns about her 7-year use of laxatives to control weight gain. She is eating daily without vomiting, but admits to binge-eating episodes three or four times weekly during the past 2 years. Compensatory vomiting stopped 6 months ago. She does not overexercise. Her BMI is low at 17.8, and her vital signs are normal. She admits to recent increased fatigue with occasional exertional dyspnea and daily diarrhea. She has been hospitalized twice in the past 3 years for dehydration not recognized as related to her laxative abuse.”

Based on the information given, what would be your best estimate regarding the following questions that refer to the case vignette:

So even though originally the questions are meant as self-report, apply them as questions to be replied as observer and provide the respective best estimate regarding the following questions that refer to the case vignette:

[One of the 2 measures in their original format]

Reply to each question with the reply categories:

[Original reply categories of the measure]

If no estimate can be given for a question, code it as 999.

Provide the estimates as a simple table. In this table, provide each question as a new variable with the corresponding values in 2 columns, 1 column containing the question number in ascending order and 1 column containing ONLY the numerical values. Provide the entire table.

MeasuresRAND 36-Item Short Form Health Survey Version 1.0 (SF-36)

The SF-36 [] assesses HRQoL and consists of 8 subscales: physical functioning, bodily pain, role limitations due to physical health problems, role limitations due to personal or emotional problems, emotional well-being, social functioning, energy or fatigue, and general health perceptions. From these subscales, the mental composite summary (MCS; comprising role limitations due to personal or emotional problems, emotional well-being, social functioning, and energy or fatigue), as well as a physical composite score (PCS), can be calculated. Evidence suggests that in EDs, MCS is more affected than PCS []; thus, this score was selected for this study. Furthermore, the SF-36 includes a single item assessing perceived change in health, which is not included in any of the subscales. Items are answered either with “yes/no” or on different Likert scales and then recoded to values ranging from 0 to 100, with higher scores indicating better HRQoL. To calculate the MCS, the authors have suggested an approach [] in which first, the subscales are z-transformed using means and SDs from the general US population; second, the subscales are aggregated by weighing them with coefficients from the general US population; and third, a t-score transformation is performed (mean 50, SD 10). This approach has been criticized for distorting the raw scores, and it was found that simply calculating the MCS by forming the mean of the 4 subscales resulted in satisfactory validity []. In this study, the simple approach was chosen because on the one hand, only the MCS was investigated and therefore a potential correlation with the PCS would not pose a problem. On the other hand, the choice of population that the scores are z-standardized and weighed with makes assumptions on the origin of data that ChatGPT-4 were trained with, something that is not entirely known and therefore could distort our data.

Eating Disorder Examination Questionnaire

The eating disorder examination questionnaire (EDE-Q) [] assesses ED symptomatology during the previous 28 days. It consists of 4 subscales: dietary restraint, weight concern, shape concern, and eating concern. By calculating the mean of these subscales, a global score can be formed. Items are answered on a scale ranging from 0 to 6, with 6 reflecting the greatest severity or frequency of ED symptoms.

Statistical Analysis

Data from ChatGPT-4 and MentaLLaMA replies were copied to an Excel sheet, indicating the vignette number, gender, sexual orientation, and round number. Female gender and heterosexual orientation were coded as “0.” We performed all analyses in RStudio []. Data quality of MentaLLaMA results was low and yielded no reliable results (). For the main outcome analyses of ChatGPT-4 replies, we used the package “lme4” [], which is suitable to calculate linear multilevel models (MLMs) with crossed random-effects structure []. This approach was chosen to take the repeated evaluation (3 rounds) of each vignette as well as the main and interaction effects of gender and sexual orientation into account. These MLMs included a random intercept for vignettes (accounting for between-vignette variance), as well as a random intercept for the gender × sexual orientation interaction nested in vignettes (accounting for within-vignette variance). This resulted in the formula:

Outcome∼Gender×Orientation+(interaction(Gender,Orientation)|Vignette)

We plotted the results using ggplot2 [].


ResultsDescriptives

shows the unconditional means of the MCS and EDE-Q. For the SF-36, there were 1.19% of missing values in items included in the MCS. For the EDE-Q, there were 0.76% of missing values in items included in the overall score (coded “999” by ChatGPT-4 and recoded to a missing value). Interrater reliability measured by the intraclass correlation coefficient was moderate for both measures (0.71 for the MCS and 0.56 for the EDE-Q).

Table 2. Means and SDs of the 2 outcome measures for each of the 4 subgroups.CharacteristicsMCS, mean (SD)EDE-Q, mean (SD)Female genderOverall (n=180)15.1 (15.6)5.61 (0.52)Heterosexual (n=90)15.3 (16.3)5.63 (0.49)Homosexual (n=90)14.8 (14.9)5.60 (0.55)Male genderOverall (n=180)12.8 (14.2)5.65 (0.47)Heterosexual (n=90)12.1 (12.5)5.64 (0.51)Homosexual (n=90)13.6 (15.7)5.65 (0.42)

aMCS: mental composite summary of the RAND 36-item short form survey.

bEDE-Q: eating disorder examination questionnaire.

Main Outcomes

For the MCS, the MLM with 360 observations indicated a significant effect of gender, with men having a lower MCS score (conditional means: 12.8 for male and 15.1 for female cases; 95% CI of the effect −6.15 to −0.35; ), with no indications of an effect of sexual orientation or an interaction effect. For the EDE-Q overall score, there were no indications for main effects of gender (conditional means: 5.65 for male and 5.61 for female cases); significant main effects of gender (conditional means: 5.65 for male and 5.61 for female cases; 95% CI –0.10 to 0.14; P=.88), sexual orientation (conditional means: 5.63 for heterosexual and 5.62 for homosexual cases; 95% CI –0.14 to 0.09; P=.67), or for an interaction effect (P=.61, 95% CI –0.11 to 0.19). See for estimates of main and interaction effects and respective P values and 95% CIs of the estimates.

Figure 1. Lower HRQoL in men compared with women. HRQoL: health-related quality of life. Table 3. Estimates calculated in the multilevel model.CharacteristicsMCS, estimate (P value), 95% CIEDE-Q, estimate (P value), 95% CIGender−3.25 (.04), −6.15 to −0.35−0.02 (.88), −0.10 to 0.14Sexual orientation−0.50 (.71), −3.04 to 2.05−0.03 (.67), −0.14 to 0.09Gender × Sexual orientation1.93 (.37), −2.18 to 6.040.04 (.61), −0.11 to 0.19

aMCS: mental composite summary of the RAND 36-item short form survey.

bEDE-Q: eating disorder examination questionnaire.


DiscussionPrincipal Results

We investigated whether gender and sexual orientation in AN and BN case vignettes would influence mental HRQoL and ED severity estimates by ChatGPT-4, a commonly used LLM. Quadruples of 30 case vignettes from scientific papers were modified in a way that only information on gender and sexual orientation varied across vignettes of the same quadruple. Vignettes were then fed into ChatGPT-4 with the instruction to estimate scores of 2 widely used psychometric instruments for assessing HRQoL (MCS of the SF-36) and ED symptomatology (EDE-Q). Findings indicated no effect of gender or sexual orientation in ED severity. Of note, the EDE-Q scores were very high, which might have led to ceiling effects. For the MCS, there was an effect of gender but not of sexual orientation, with vignettes describing men resulting in lower MCS than vignettes describing women. Thus, ChatGPT-4 assumed a greater impairment in mental HRQoL for men compared with women with similar ED severity. Since there is no evidence from previous studies that supports this finding, this can be considered a bias.

Interpretation

While the effect for gender was statistically significant, it is also important to consider the minimal clinically important difference (MCID), that is, to evaluate whether differences in scores would be clinically relevant []. For the MCS, the MCID has been estimated to be between 3 and 9 points [,]. With a difference of 2.3, the gender effect found in this study was slightly below an MCID. However, a longitudinal study showed that MCS scores in patients with ED improved only 1-6 points during 2 years of treatment although ED symptoms improved markedly, which highlights the clinical relevance of below-MCID differences in MCS scores in participants with ED [].

Of note, the EDE-Q scores generated by ChatGPT-4 were around 1.6 points above the scores reported in ED samples [-]. Likewise, the MCS scores generated by ChatGPT-4 were around 20 points below mean scores in other ED cohorts [,]. This has implications on the evaluation of the MCID, as potential floor effects need to be considered.

The gender bias delivered by ChatGPT-4 could be due to social roles assuming general lower mental problems in men than in women and consequently evoking more attention if mental problems are identified. Thus, ChatGPT-4 might mirror possible prejudices, which should be taken up as a nudge to try to correct these prejudices in real life. In the field of EDs, the role of gender, sexual orientation, and the influence of stigmatization and biases in our society need to be understood better [,].

Strengths and Limitations

Our study has several strengths: First, real vignettes from scientific publications were used and varied in a way that the distinct influence of gender and sexual orientation could be singled out. To our knowledge, this is the first study that tests a potential bias when instructing an LLM to evaluate clinical cases with the use of psychometric instruments. Second, while many studies mentioned in this paper have used ChatGPT-3.5, we used ChatGPT-4, which has been shown to perform better in the field of mental health (). Furthermore, we attempted to repeat the analyses in MentaLLaMA, which is fine-tuned for the mental health domain. Third, by applying repeated testing, we reached a much larger sample size than other vignette studies, ensuring sufficient power for our analyses.

This study also has limitations. First, the gender ratio of the original vignettes was not balanced (only 2 male vignettes), which might have had an impact on the evaluation of these vignettes. However, this ratio approximately reflects the gender ratio of AN and BN in the general population. Second, although we sought to set the temperature to zero and followed available instructions to do so when using the applied interface, we could not verify whether the setting of the temperature via “custom instructions” actually resulted in respective changes in the system setting of the temperature. Finally, the deviations in EDE-Q and MCS scores raise the question whether scores generated by ChatGPT-4 can be transferred to scores reported in ED research and highlight that the use of LLMs for scoring patient vignettes is still in the fledging stages.

Implications and Future Directions

Our findings highlight the importance of examining biases in LLMs in the context of (mental) health care. Future studies should investigate the generalizability of these findings by exploring biases in other LLMs as well as in other fields of (mental) health. As ChatGPT-4 has been found to disregard conditions that are understudied [], being aware of research and knowledge gaps as well as existing biases and stigma in society when using and training LLMs is of high importance. Furthermore, potential mitigation strategies for biases introduced by LLMs should be investigated. Although AI is not widely used yet in the assessment of disorders, it is already used in assisting doctor’s decision-making [ ,]. Furthermore, ChatGPT-3.5 has been used to generate more diverse and inclusive case vignettes to be used in medical education []. It has been proposed that in health care, specially trained LLMs are needed, as ChatGPT-4 was not intended to be used in a clinical context [] and was deemed unreliable in offering personalized medical advice [].

In an exploratory analysis, we attempted to replicate the analyses using MentaLLaMA, which is one of the very few available LLMs specialized for mental health topics with published scientific evidence []. However, MentaLLama is based on an older LLM and therefore appears to have difficulties in conducting meaningful complex vignette assessments as needed for this study. When using MentaLLaMA, our prompting strategy had to be adapted by creating a separate prompt for every single question. Still, MentaLLaMA yielded insufficient interrater correlation coefficients. Thus, data quality was much lower compared with the more recent and advanced model, GPT-4, on which our main analyses were based, leading to findings with low reliability, thus providing very limited insight ().

More powerful LLMs in the field of mental health need to be developed and validated, given that more recent publicly available models lack published evidence of their scientific validation []. When training specialized LLMs, policy makers should make sure that measures are taken to minimize biases in the training material and that proposed frameworks for responsible AI [] are considered. A potential next step could be to program LLMs or AI systems as “verifiers” to check for biases in specialized LLMs, using a similar methodology to that used in this study. This would establish an additional layer of scrutiny and validation, enhancing the reliability and fairness of LLM applications in mental health care. In a clinical context, it is important to understand the precision with which LLMs can interpret and apply information from case vignettes or patient records, compared with the accuracy achieved when affected patients complete these assessments themselves.

Conclusions

This study showed that ChatGPT-4 might exhibit a potential gender bias when evaluating ED symptomatology and mental HRQoL. Researchers as well as clinicians should be aware of potential biases when using LLMs to support clinical decision-making. Better understanding and mitigation of risk of bias related to gender and other factors, such as ethnicity or socioeconomic status, are highly warranted to ensure responsible use of LLMs.

R Schnepper is funded by the Swiss State Secretariat for Education, Research and Innovation (SERI, under funding number: 22.00094) in the context of a European Union (Horizon Europe) research consortium “Long Covid” (funding number: 101057553). The publication was funded and supported by the Open Access Fund of Universität Trier and by the German Research Foundation (DFG).

R Schnepper contributed to the conceptualization, methodology, and data collection; conducted the formal analysis; and wrote the original draft of the paper. NR contributed to the writing of the original draft. R Schaefert contributed to the conceptualization and manuscript review and editing. LL contributed to the conceptualization and manuscript review and editing. GM contributed to the conceptualization, methodology, data collection, formal analysis, writing the original draft, and manuscript review and editing. All authors read and approved the final submitted version of the paper.

R Schaefert and GM received funding from the Stanley Thomas Johnson Stiftung and Gottfried & Julia Bangerter-Rhyner-Stiftung under projects nos. PC 28/17 and PC 05/18, from Gesundheitsförderung Schweiz under project no. 18.191/K50001, and in the context of a Horizon Europe project from the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 22.00094 and from Wings Health in the context of a proof-of-concept study. GM received funding from the Swiss Heart Foundation under project no. FF21101, from the Research Foundation of the International Psychoanalytic University (IPU) Berlin under projects nos. 5087 and 5217, from the German Federal Ministry of Education and Research under budget item 68606, and from the Hasler Foundation under project no. 23004. GM is a cofounder, member of the board, and shareholder of Therayou AG, and active in digital and blended mental health care. GM receives royalties from publishing companies as author, including a book published by Springer, and an honorarium from Lundbeck for speaking at a symposium. Furthermore, GM is compensated for providing psychotherapy to patients, acting as a supervisor, serving as a self-experience facilitator ("Selbsterfahrungsleiter"), and for postgraduate training of psychotherapists, psychosomatic specialists, and supervisors. NR is a coworker at Therayou AG, active in digital and blended mental health care. NR received funding from the Hasler Foundation under project no. 23004 and from Wings Health AG in the context of a proof-of-concept study.

Edited by Oren Asman; submitted 01.03.24; peer-reviewed by Ahmed Hassan, Tianlin Zhang; final revised version received 30.10.24; accepted 24.11.24; published 20.03.25.

© Rebekka Schnepper, Noa Roemmel, Rainer Schaefert, Lena Lambrecht-Walzinger, Gunther Meinlschmidt. Originally published in JMIR Mental Health (https://mental.jmir.org), 20.3.2025.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on https://mental.jmir.org/, as well as this copyright and license information must be included.

Comments (0)

No login
gif