Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study

IntroductionBackground

Suicide is a serious public health concern. Suicide rates have risen at an alarming rate in the past 20 years, and in the United States, suicide is the second leading cause of death in adults aged 18-45 years []. In 2021, approximately 50,000 people in the United States died by suicide, which marks the highest national rate of suicide in decades []. As suicide rates increase, the behavioral health care workforce in the United States has not expanded enough to keep up with these mental health demands, limiting the timely access to care that is essential for suicide risk detection and prevention [].

Suicide risk is difficult to predict. Research has demonstrated that there are numerous individual, relationship, community, and societal risk factors associated with suicide, such as history of previous suicide attempts, psychiatric diagnosis, sense of hopelessness, social isolation, community violence, and access to lethal means of suicide [-]. More recently, suicide theories and research suggest ideation-to-action pathways to help explain suicide risk, where people who think about suicide are at a higher risk of participating in suicidal behavior [-].

The prevalence of suicidal ideation (SI), which is defined as “thinking about, considering, or planning suicide” [], is common, with 12.3 million Americans aged 18 years and older having thoughts of suicide in 2021 []. SI is predictive of suicide attempts and completed suicide [,]. SI is also a more sensitive predictor of lifetime risk for suicide than imminent risk []. Research has suggested that among those exhibiting SI, there is a 29% conditional probability of making a suicide attempt []. Other research has shown that those with nearly daily SI were 5 to 8 times more likely to attempt suicide and 3 to 11 times more likely to die by suicide within 30 days [].

Artificial intelligence (AI) methods have been used for assessing mental health factors such as psychiatric symptom severity, diagnosis, and clinical risk using free text generated by the patient. Researchers using natural language processing (NLP) and machine learning (ML) were able to identify suicidal behavior from electronic medical records [] and detect SI in a variety of different free-text settings []. In addition, an NLP-based system to determine the likelihood of crisis in patient chat messages to their clinicians was developed and implemented with reliable retrospective and prospective performance as a clinical support tool for a crisis specialist team [].

Recent advances in AI methods, such as large language models (LLMs), have also shown success in a variety of medical applications. Both generalist LLMs, such as generative pretrained transformer 4 (GPT-4), and medical domain–specific LLMs, such as Med-PaLM 2, have exhibited medical competency on benchmarks such as the United States Medical Licensing Examination (USMLE) exam [,]. Generalist LLMs can sometimes outperform the domain-specific LLMs, as was recently found with GPT-4 outperforming MedPaLM 2 on the MedQA medical benchmark []. Finally, Med-PaLM-2 was also found to be effective at determining psychiatric functioning from free text, including patient-generated information during patient interviews [].

Objective

We seek to leverage the capabilities of LLMs to detect or predict SI with plan among patients enrolled in a national telemental health platform, using patient-generated free text at intake. We will benchmark the performance of this LLM-based prediction against a cohort of senior mental health clinician experts.

MethodsOverview

The study consisted of clinicians completing a digital questionnaire where they were asked to predict whether a patient would endorse SI with a plan during the course of their treatment, based on patient-generated text describing their chief complaint. The same chief complaint texts were then served to the LLM GPT-4 with the same questionnaire instructions. The classification performance of the clinicians and GPT-4 were evaluated and compared.

Data Acquisition

The retrospective patient data used in this study were collected as part of the standard of care at Brightside Health and deidentified for research purposes. All patients treated at Brightside consent at intake to the terms of use and privacy policy that include consenting to Brightside’s use of their data for research purposes.

Inclusion Criteria

Data from patients who completed intake on the Brightside platform after March 15, 2023, and endorsed current SI (at intake) or subsequent SI (post intake and during the course of treatment) were included in the study set, along with a random cohort of patients treated during the same time frame who never endorsed SI with plan. In order to be included in the study sample, patients had to attend at least 1 psychiatric or therapy appointment and complete the chief complaint section of their digital intake form. Patients who left the chief complaint section empty were excluded.

Data and Outcome Variables

Patient-generated free text (chief complaint) was extracted from patient intake as the answer to the question “In your own words, what are you feeling or experiencing?” and any personal identifiers (such as age, birthdate, name, location, email address, phone number, and social security number) within the free text were replaced with asterisks. In addition, patient data extracted from intake included age, gender identity, and history of previous suicide attempts. Clinicians and the LLM did not have access to the age or gender identity of the patients and were only shown deidentified patient-generated free text and then the patients’ self-reported history of suicide attempts.

SI with plan was determined from answers to question 9 of the Patient Health Questionnaire-9 (PHQ-9). The PHQ-9 is a self-report questionnaire consisting of 9 questions measuring depression symptom severity ranging from 0 to 3 (not at all, several days, more than half the days, and nearly every day, respectively) within the past 2 weeks and includes a specific question related to the frequency of suicidal thoughts (item 9). If a patient endorses SI on the Brightside platform (item 9 answer value >0), a follow-up Brightside proprietary question asks whether the suicidal thoughts are something the patient has made specific plans for. At Brightside, the PHQ-9 is administered to all patients at intake and requested every 2 weeks during the course of treatment. PHQ-9 answers at intake and the date of the first SI with plan relative to intake were also extracted for this study.

Classification Label Definitions

The patients positive for SI with plan were defined as those having endorsed SI in the PHQ-9 at intake or any point during the later course of treatment and subsequently responded that the SI was something they had made specific plans for. Patients negative for SI with plan were defined as those with no PHQ-9 item 9 values >0, that is, those who had never endorsed SI in their PHQ-9 screenings.

Clinician Questionnaire Design

After the creation of the study data set, 6 clinicians employed at Brightside Health were recruited and all consented to participate in the study. Clinicians 1, 2, and 3 are senior psychiatrists with 18 to 30 years of clinical experience. Clinicians 4, 5, and 6 are senior psychologists with 10 to 23 years of clinical experience.

Each clinician was presented with the same questionnaire with the same randomized order of questions. Presented with a patient’s deidentified chief complaint, they were asked to answer the following questions: (1) do you expect this patient to endorse SI with a plan in the course of their treatment? (yes or no) and (2) how confident are you? (high, medium, or low).

The clinicians were then presented with additional information as to whether the patient endorsed a previous suicide attempt and reprompted with the same questions. All chief complaints and questions were displayed one at a time. Clinicians were given 2 weeks to complete the questionnaire with no other instructions or training and were not required to complete it in a single sitting.

Large Language Model Prompt Design

In this study, we evaluated OpenAI’s GPT-4 text-only model’s ability to predict whether a patient would endorse SI with plan during the course of treatment. All model results were produced in Python (Python Software Foundation) with OpenAI’s application programming interface and GPT-4 (“gpt-4-1106-preview” with default parameters) [].

First, the model context was constructed using step-by-step instructions and few-shot prompting with 3 real examples of chief complaints and their respective correct, expected answers (). Note that the examples chosen for the few-shot prompting were not included in the study data set and were redacted for publication.

Textbox 1. GPT-4 few-shot prompt template for predicting suicidal ideation with plan in chief complaint text.

You are a trained psychiatric professional. You will be presented with a patient’s answer to the question “In your own words, what are you feeling or experiencing?” also referred to as a patient’s chief complaint.

Your task is to determine based on a patient’s chief complaint, if that patient will endorse suicidal ideation with plan in the course of their treatment.

You will generate your answer as JSON:

Use the following step-by-step instructions to determine your answer:

Step 1. Answer in the key “at_risk” with only the values Yes or NoStep 2. Answer in the key “confidence” with only the values High, Medium, or Low, how confident you are in your answer.Step 3. Answer in the key “risk_indicators” with one or more values from the list of possible risk indicators below that determined your Yes or No answer.

Separate risk indicators with a “; ”.

If Other, specify your answer as “Other-<insert your answer>”. If no risk indicators are identified, answer with None.

List of possible risk indicators values:

Recent Stress, Loss, or TraumaHistory of TraumaChronic medical conditionsSubstance usePrevious suicide attemptLack or loss of relationships or supportSocial isolationFamily history of suicideImpulsive or aggressive languageExplicit mentions of suicide, suicidal thoughts, or self harmDeath imagery or metaphorsApathy, indifference or emotional detachmentSense of HopelessnessOther

Here is an example of a chief complaint with a Yes to suicidal ideation with plan:

“<text redacted for publication> ”

Your answer would be:

Here is an example of a chief complaint with a No to suicidal ideation with plan: “<text redacted for publication>”

Your answer would be:

Here is an example of a chief complaint with a No to suicidal ideation with plan:

“<text redacted for publication>”

Your answer would be:

Next, the output format of the model was specified as JavaScript Object Notation for ease of analysis. In addition to the prediction of SI with plan during the course of treatment, the model was also asked to provide a confidence level (high, medium, and low) to the prediction (similar to the clinicians’ questionnaire) and to provide reasoning from a list of explicitly provided risk indicators.

Finally, the deidentified patient-generated chief complaint text was given to the model in the user prompt. Each chief complaint was provided independently and then the LLM was reset back to the original context.

In order to evaluate the model’s performance when served the additional information of patient self-reported previous suicide attempts, the sentence “I have attempted suicide before” or “I have never attempted suicide before” was appended to the end of the chief complaint and served as the prompt with the same context.

Performance Analysis

All analyses were performed in Python 3.8.12 with the package scikit-learn version 1.3.1 []. For comparison of performance, analyses were performed on positive for SI with plan at intake versus negative for SI during the entire course of treatment, as well as positive for SI with plan post intake versus the same data set of negative for SI during treatment.

Classification and Predictive Performance

Clinician and model performances in the ability to predict whether a chief complaint text sample was positive for SI with plan, at intake, and post intake, were evaluated for accuracy, sensitivity, specificity, and precision. Accuracy was defined as the proportion of correctly predicted samples over the total number of samples. Precision (or positive predictive value) was defined as the proportion of correctly predicted positive samples over the total number of predicted positive samples. Sensitivity was defined as the proportion of correctly predicted positive samples over the total number of positive samples. Specificity was defined as the proportion of correctly predicted negative samples over the total number of negative samples. As an additional baseline reference, previous suicide attempt information (yes or no) as a stand-alone predictor was also included in the evaluation.

Clinician and Large Language Model Agreement

To measure the agreement between the clinician and GPT-4’s predictions, the Cohen κ statistic, which measures interrater agreement for categorical data, was calculated for each clinician and GPT-4 pairing.

Clinical Consensus and Confidence

Clinical consensus was defined as instances in which all clinicians answered with the same predicted outcome for a given sample, regardless of whether the prediction was correct. Rates of clinical consensus and rates of confidence were calculated to measure the variability and difficulty of clinical assessments on the given samples.

Accuracy of Clinical Consensus Influence on Large Language Model Performance

To measure the influence of the accuracy of clinical consensus on GPT-4 performance, subsets of chief complaint text samples where at least 1, 2, 3, 4, 5, or all 6 clinicians not only agreed but also correctly predicted the outcome for a given sample were evaluated for GPT-4 accuracy, sensitivity, specificity, and precision.

Risk Indicator Language and Clinician Performance

The GPT-4 prompt included a request to provide the rationale for its prediction from a list of explicitly provided risk indicators (). Clinician performance was then re-evaluated on patient chief complaints with no GPT-4–identified risk indicators as a way to understand how difficult these cases were to clinical experts.

Due to the generative nature of an LLM, GPT-4 occasionally will produce an answer that is not from the list of those that are explicitly defined in the instructions. For the purpose of this analysis, only the following explicit risk indicators defined as exact string match were assessed: “recent stress, loss, or trauma,” “history of trauma,” “chronic medical conditions,” “substance use,” “previous suicide attempt,” “lack or loss of relationships or support,” “social isolation,” “family history of suicide,” “impulsive or aggressive language,” “explicit mentions of suicide, suicidal thoughts, or self-harm,” “death imagery or metaphors,” “apathy, indifference or emotional detachment,” and “sense of hopelessness.”

Ethical Considerations

This study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of WCG (protocol 20240207).

ResultsOverview

At the conclusion of the study (December 13, 2023), 260 patients met inclusion criteria and were positive for SI with plan. A total of 140 patients were positive for SI with plan at the time of intake and 120 patients were positive for SI with plan post intake in their subsequent treatment. A random subset of 200 patients was selected from those who met the inclusion criteria and were negative for SI with plan. A summary of the data can be found in .

Table 1. Summary of data for patients with no SI with plan (n=200), SI with plan indicated at intake (n=140), and SI with plan indicated post intake (n=120).
No SI with plan (n=200)SI with plan at intake (n=140)SI with plan post intake (n=120)Age (years), mean (95% CI)37.2 (35.7-38.9)34.4 (32.5-36.3)32.4 (30.3-34.5)Gender identity, n (%)
Women135 (67.5)76 (54.3)59 (49.2)
Men64 (32)57 (40.7)59 (49.2)Ethnicity, n (%)
White152 (76)94 (67.1)73 (60.8)
Hispanic16 (8)20 (14.3)14 (11.7)
Black13 (6.5)13 (9.3)16 (13.3)
Asian10 (5)6 (4.3)8 (6.7)
Other9 (4.5)7 (5)9 (7.5)Average chief complaint word count (95% CI)49.6 (41.3-57.9)58 (33-83.1)57.2 (44.2-70.3)Average days between first SI with plan date and chief complaint (95% CI)—a0 (0)62.6 (52.4-72.8)Average PHQ-9b total score at first SI with plan (95% CI)—21.1 (20.2-21.9)19.0 (17.8-20.2)Number of patients with PHQ-9 item 9 score value at first SI with plan, n (%)
0—0 (0)0 (0)
1—32 (22.9)34 (28.3)
2—34 (24.3)29 (24.2)
3—74 (52.9)57 (47.5)
With specific plan—140 (100)120 (100)Average PHQ-9 total score at intake (95% CI)13.5 (12.7-14.2)20.9 (20.1-21.7)18.3 (17.2-19.4)Number of patients with PHQ-9 item 9 score value at intake, n (%)
0200 (100)0 (0)34 (28.3)
10 (0)32 (22.9)34 (28.3)
20 (0)34 (24.3)20 (16.7)
30 (0)74 (52.9)32 (26.7)
With specific plan0 (0)140 (100)0 (0)
Previous suicide attempt14 (7)55 (39.3)40 (33.3)

aNot applicable.

bPHQ: Patient Health Questionnaire-9.

Prediction PerformancePredicting SI With Plan at Intake

The performance of the previous suicide attempt alone to predict SI with plan at the time of intake was similar to both GPT-4 and clinicians except for the low sensitivity at 0.39 ().

GPT-4 performed with similar accuracy (0.67) and higher sensitivity (0.62) in predicting SI with plan at the time of intake based on the chief complaint text only, as compared with the average accuracy (0.7) and sensitivity (0.53) across our 6 clinician participants (). However, GPT-4 performed with lower specificity (0.71) and precision (0.6) than the average clinician specificity (0.82) and precision (0.69). The interrater agreement between GPT-4 and each clinician was moderate as indicated by an average Cohen κ of 0.49.

Additional knowledge of the previous suicide attempt increased overall performance across clinicians (accuracy=0.75; sensitivity=0.59; specificity=0.86; precision=0.77). Additional knowledge of the previous suicide attempts significantly increased sensitivity for GPT-4 but decreased accuracy, specificity, and precision (accuracy=0.64; sensitivity=0.84; specificity=0.51; precision=0.54). The interrater agreement between GPT-4 and each clinician also decreased to an average Cohen κ of 0.39 with the additional information of the previous suicide attempts.

Table 2. Performance results for predicting suicidal ideation with a plan at the time of intake and predicting suicidal ideation with a plan in the future post intake based solely on chief complaint versus chief complaint plus knowledge of the previous attempt for GPT-4 and 6 clinicians. The performance of the previous suicide attempt alone as a predictor is included for baseline reference.
True negative, nFalse positive, nFalse negative, nTrue positive, nAccuracySensitivitySpecificityPrecisionCohen κ with GPT-4SI with plan at intake (n=140) versus no SI with plan (n=200)
Baseline for comparison: previous suicide attempts only1861485550.710.390.930.8—a
Chief complaint text only

GPT-41415953870.670.620.710.6—

Clinician 11604058820.710.590.80.670.53

Clinician 21891195450.690.320.950.800.36

Clinician 31386248920.680.660.690.60.56

Clinician 41831785550.770.390.920.760.44

Clinician 51623858820.720.590.810.680.5

Clinician 61564452880.720.630.780.670.54

Average across clinicians————0.700.530.820.70.49
Chief complaint text + previous suicide attempt knowledge

GPT-410298231170.640.840.510.54—

Clinician 11633749910.750.650.820.710.46

Clinician 2194689510.720.360.970.90.21

Clinician 315248391010.740.720.760.680.5

Clinician 41871367730.770.520.940.850.329

Clinician 51732753870.770.620.870.760.4

Clinician 61594147930.740.660.80.690.42

Average across clinicians————0.750.590.860.770.39SI with plan post intake (n=120) versus no SI with plan (n=200)
Baseline for comparison: prior suicide attempt only1861480400.710.330.930.74—
Chief complaint text only

GPT-4141—65550.610.460.710.48—

Clinician 1160—69510.660.430.80.560.44

Clinician 2189—100200.650.170.950.650.26

Clinician 3138—54660.640.550.690.520.44

Clinician 4183—84360.680.30.920.680.34

Clinician 5162—70500.660.420.810.570.43

Clinician 6156—56640.690.530.780.590.50

Average across clinicians————0.660.40.820.590.4
Chief complaint text + prior suicide attempt knowledge

GPT-4102—31890.60.740.510.48—

Clinician 1163—59610.70.510.820.620.37

Clinician 2194—90300.70.250.970.830.17

Clinician 3152—49710.70.590.760.60.45

Clinician 4187—76440.720.370.940.770.27

Clinician 5173—63570.720.480.870.680.36

Clinician 6159—54660.70.550.80.620.35

Average across clinicians————0.710.460.860.690.33

aNot applicable.

Predicting SI With Plan Post Intake

Performance decreased for both clinicians and GPT-4 when predicting future SI with plan post intake. Note that specificity results were consistent with predicting SI with plan at intake, as there was no change in the negative samples.

GPT-4 performed with similar accuracy (0.61) and higher, but still poor, sensitivity (0.46) in predicting SI with plan post intake based solely on the chief complaint compared with the average accuracy (0.66) and sensitivity (0.4) across the 6 clinicians (). GPT-4 performed with lower precision (0.48) than the average clinician precision (0.59). The interrater agreement between GPT-4 and each clinician remained moderate at an average Cohen κ of 0.4.

Additional knowledge of the previous suicide attempts increased performance across all clinicians (accuracy=0.71; sensitivity=0.46; precision=0.69). Additional knowledge of the previous suicide attempt significantly increased sensitivity for GPT-4 but decreased accuracy and precision (accuracy=0.6; sensitivity=0.74; precision=0.48). The interrater agreement between GPT-4 and each clinician was lower, with an average Cohen κ of 0.33 with the additional information.

Clinical Consensus and Confidence

Clinical consensus was defined as instances in which all 6 clinicians agreed on the predicted outcome for a given sample, regardless of whether the prediction was correct. Clinical consensus occurred in 52% (104/200) of “no SI with plan” samples, 40.7% (57/140) of “SI with plan at intake” samples, and 40% (48/120) of “SI with plan postintake” samples (). For SI with plan samples with a clinical consensus, the agreed-upon prediction was correct 61.4% (35/140) of the time for “SI with plan at intake” versus much lower at 25% (25/120) of the time for “SI with plan postintake.” For the “no SI with plan” samples, the clinicians’ agreed-upon prediction was correct at a high rate of 98.1% (102/200).

Table 3. Rates of clinical consensus are defined as instances in which all 6 clinicians agreed on the predicted outcome for a given sample.
No SI with plan (n=200), n (%)SI with plan at intake (n=140), n (%)SI with plan post intake (n=120), n (%)Number of samples with clinical consensus104 (52)57 (40.7)48 (40)Clinical consensus predicted SI with plan2 (1.9)35 (61.4)12 (25)Clinical consensus predicted no SI with plan102 (98.1)22 (38.6)36 (75)

In addition, clinicians, on average, had lower rates of high confidence (even when answers were correct) compared with GPT-4 (). On average, clinicians answered correctly “no with high confidence” in 9.5% (19/200) of “no SI with plan” samples versus GPT-4 answered “no with high confidence” in 35% (70/200). Clinicians answered correctly “yes with high confidence” in 15.7% (22/140) of “SI with plan at intake” samples versus GPT-4 at 29.3% (41/140). Rates of correctly answered “yes with high confidence” were lower in “SI with plan postintake” samples but were higher for GPT-4 compared with average clinician rates (13.3%, 16/120 vs 7.2%, 8.7/120).

Table 4. Rates of high confidence answers.
No SIa with plan (n=200)SI with plan at intake (n=140)SI with plan post intake (n=120)
Answered yes with high confidence, n (%)Answered no with high confidence, n (%)Answered yes with high confidence, n (%)Answered no with high confidence, n (%)Answered yes with high confidence, n (%)Answered no with high confidence, n (%)Clinician 15 (2.5)6 (3)45 (32.1)1 (0.7)16 (13.3)2 (1.7)Clinician 20 (0)19 (9.5)5 (3.6)7 (5.0)1 (0.8)4 (3.3)Clinician 32 (1)41 (20.5)20 (14.3)9 (6.4)9 (7.5)6 (5)Clinician 40 (0)0 (0)1 (0.7)0 (0)0 (0)0 (0)Clinician 50 (0)2 (1)23 (16.4)0 (0)5 (4.2)3 (2.5)Clinician 62 (1)46 (23)38 (27.1)13 (9.3)21 (17.5)12 (10)Average across clinicians (%)1.5 (0.75)19 (9.5)22 (15.7)5 (3.6)8.7 (7.2)4.5 (3.8)GPT-4b1 (0.5)70 (35.0)41 (29.3)17 (12.1)16 (13.3)14 (11.7)

aSI: suicidal ideation.

bGPT-4: generative pretrained transformer 4.

Accuracy of Clinical Consensus and GPT-4 Performance

A range of accurate clinical consensus samples was defined as samples where several clinicians, ranging from at least 1 to all 6, not only agreed on the predicted outcome but also correctly predicted the outcome. There were 316 samples of the “SI with plan at intake” and “no SI with plan” samples where at least 1 clinician predicted the outcome correctly versus 137 samples where all 6 clinicians predicted the outcome correctly (). There were 282 samples of the “SI with plan postintake” and “no SI with plan” samples where at least 1 clinician predicted the outcome correctly versus 114 samples where all 6 clinicians predicted the outcome correctly.

Table 5. Performance results for GPT-4 solely on the chief complaint in samples where at least 1, 2, 3, 4, 5, or all 6 clinicians correctly predicted the outcome of those samples.Number of clinicians correctly predicting samples’ consensus thresholdNumber of samplesTrue negativeFalse positiveFalse negativeTrue positiveAccuracySensitivitySpecificityPrecisionSI with plan at intake (original n=140) versus no SI with plan (original n=200)
≥13161415732860.720.730.710.60
≥22841415214770.770.850.730.60
≥3259137427730.810.910.770.64
≥4236133362650.840.970.790.64
≥5200123240530.8810.840.69
613789130350.9110.870.73SI with plan post intake (original n=120) versus no SI with plan (original n=200)
≥12821415731530.690.630.710.48
≥22661415223500.720.690.730.49
≥32331374210440.780.820.770.51
≥4211133366360.800.860.790.5
≥5169123241210.850.960.840.47
611489130120.8910.870.48

As the accurate clinical consensus threshold increased, GPT-4 performance increased significantly in those samples (). When assessing the “SI with plan at intake” and “no SI with plan” samples with a clinical consensus of 3 or more and correct predictions, GPT-4 performed with an accuracy of 0.81, sensitivity of 0.91, specificity of 0.77, and precision of 0.64. When assessing the “SI with plan postintake” and “no SI with plan” samples with a clinical consensus of 3 or more and correct predictions, GPT-4 performed with an accuracy of 0.80, sensitivity of 0.86, and precision of 0.51.

Risk Indicators Identified in Chief Complaint Text by GPT-4

At least 1 risk indicator was identified in the chief complaint text by GPT-4 on 45.5% (91/200) of “no SI with plan” samples (). A total of 70% (98/140) of “SI with plan at intake” samples and 54.2% (65/120) of “SI with plan postintake” samples had at least 1 GPT-4–identified risk indicator. The most common risk indicator in “SI with plan at intake” samples identified by GPT-4 was “sense of hopelessness” (in 40% [56/140] of samples, compared with 27.5% [33/120] of “SI with plan postintake” and 16.5% [33/200] of “no SI with plan”). The most common risk indicator in “no SI with plan” samples was “recent stress, loss, or trauma” (in 25.5% [51/200] of samples, compared with 22.1% [31/140] of “SI with plan at intake” samples and 17.5% [21/120] of “SI with plan postintake” samples). In addition, the rate of identification of “social isolation” as a risk factor in “SI with plan postintake” samples (15/120, 12.5%) was higher in both “no SI with plan” (22/140, 5.7%) samples and “SI with plan at intake” samples (33/200, 6.5%).

Table 6. Number of samples per explicit risk indicator identified by GPT-4.
No SI with plan (n=200)SI with plan at intake (n=140)SI with plan post intake (n=120)Number of risk indicators identified by GPT-4, n (%)
0109 (54.5)42 (30)55 (45.8)
134 (17)28 (2022 (18.3)
234 (17)37 (26.4)18 (15)
316 (8)22 (15.7)15 (12.5)
44 (2)6 (4.3)8 (6.7)
53 (1)3 (2.1)1 (0.8)
60 (0)2 (1.4)

View original article

JMIR MENTAL HEALTH

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study

Comments (0)