Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study

IntroductionBackground

Counseling refers to a relationship between a professional counselor and individuals, families, or other groups that empowers the clients to achieve mental health, wellness, education, and career goals. Specifically, in individuals with psychological or interpersonal difficulties, mental health counseling may be seen as a key helping intervention. Counseling sessions embrace a client-centered approach, fostering an environment of trust and exploration. These sessions delve deep into personal experiences, where clients share intimate details while therapists navigate the dialogue to cultivate a safe and supportive space for healing. Discussions within these sessions span a wide range of topics, from recent life events to profound introspections, all of which contribute to the therapeutic journey. An important aspect of the counseling process lies in the documentation of counseling notes (summary of the entire session), which is essential for summarizing client stressors and therapy principles. Session notes are pivotal in tracking progress and in guiding future sessions. However, capturing the intricacies of these conversations poses a formidable challenge, demanding training, expertise, and experience of mental health professionals. These summaries distill key insights, including symptom and history (SH) exploration, patient discovery (PD), and reflection, while filtering out nonessential details. However, the need for meticulous recordkeeping can sometimes detract from the primary focus of therapy. Maintaining a seamless flow of conversation is paramount in effective therapy, where any disruption can impede progress. To streamline this process and ensure continuity, automation emerges as a promising solution for the counseling summarization task. While advances in artificial intelligence (AI) have revolutionized document summarization, the application of these technologies to mental health counseling remains relatively unexplored.

Previous studies [-] have recognized the potential of counseling summarization in optimizing therapeutic outcomes. However, existing models often overlook the unique nuances inherent in mental health interactions. Standard counseling dialogues, using reflective listening, involve identifying current issues; developing a biopsychosocial conceptualization, including past traumas and coping strategies; and chalking out treatment plans. The counseling dialogues also include discussion on between-session issues as well as crises, if any. An effective counseling summary should selectively capture information pertinent to each of these categories while eliminating extraneous details.

Despite the demonstrated capabilities of large language models (LLMs) in various domains, research in mental health counseling summarization is scarce. One major obstacle is the lack of specialized data sets tailored to counseling contexts. To bridge this gap, we embarked on a two-pronged approach: (1) creating a novel counseling-component–guided summarization data set, called Mental Health Counseling-Component–Guided Dialogue Summaries (MentalCLOUDS); and (2) evaluating state-of-the-art LLMs on the task of counseling-component–guided summarization. Through these efforts, we aim to propel the integration of AI technologies into mental health practice, ultimately enhancing the quality and accessibility of therapeutic interventions.

Related WorkOverview

Summarizing counseling conversations enhances session continuity and facilitates the development of comprehensive therapy plans. However, analyzing these interactions manually is an arduous task. To address this challenge, advances in AI and natural language processing, particularly in summarization techniques, offer a promising solution. Summarization tasks can be approached via an extractive [] or an abstractive [] viewpoint. Extractive summarization involves identifying the most relevant sentences from an article and systematically organizing them. Given the simplicity of the approach, the resultant extractive summaries are often less fluent. By contrast, abstractive summarization extracts important aspects of a text and generates more coherent summaries. By using summarization, therapists can access recaps of sessions, sparing them the need to sift through lengthy dialogues. While summarization has been a long-studied problem in natural language processing [], recent attention has shifted toward aspect-based summarization, a method that focuses on generating summaries pivoted on specific points of interest within documents.

Chen and Verma [] proposed a retrieval-based medical document summarization approach in which the user query is fine-tuned using a medical ontology, but their method is limited due to its overall primitive design. Konovalov et al [] highlight the importance of identifying emotional reactions and “early counseling” components. Strauss et al [] used machine learning approaches to automate the analysis of clinical forms, and they envision using machine learning in mental health to a certain extent. Furthermore, research on major depressive disorder [] underscores the significance of identifying crucial indicators from patient conversations, such as age, anxiety levels, and long episode duration, in the choice of the appropriate level of antidepressant medication, guiding subsequent sessions and prescriptions. Subsequently, the effectiveness of the prescribed antidepressants is monitored to assess the patient’s response.

This concept identifies crucial indicators from the patient’s conversations with the therapist and guides subsequent follow-up sessions based on the patient’s history of interactions and prescriptions. Deep learning approaches, such as the use of recurrent neural networks and long short-term memory, have been used to predict 13 predefined mental illnesses based on neuropsychiatric notes that contain 300 words each, on average, about the patient’s present illness and events associated with it, followed by a psychiatric review system that mentions the mental illness related to the patient []. Chen et al [] proposed an extractive summarization approach using the Bidirectional Encoder Representations from Transformers (BERT) model [] to reduce physicians’ efforts in analyzing tedious amounts of diagnosis reports. However, there remains a notable gap in effectively capturing medical information in session summaries.

In addition, some contemporary works used authentic mental health records to create synthetic data sets []. Afzal et al [] reported the summarization of medical documents to identify PICO (Population, Intervention, Comparison, and Outcomes) elements. Manas et al [] proposed an unsupervised abstractive summarization in which domain knowledge from the Patient Health Questionnaire-9 was used to build knowledge graphs to filter relevant utterances. A 2-step summarization was devised by Zhang et al [] wherein partial summaries were initially consolidated, and the final summary was generated by fusing these chunks. Furthermore, Zafari and Zulkernine [] demonstrated a web-based application built using information extraction and annotation tailored to the medical domain.

For dialogue summarization, abstractive summarization has been the de facto standard due to its ability to capture critical points coherently. Nallapati et al [] used an encoder-decoder–based abstractive summarization method, which was further improved via the attention mechanism []. Subsequently, See et al [] introduced a hybrid approach of extractive and abstractive summarization. Chen and Bansal [] proposed a reinforcement learning-based approach as a mixture of extractive and abstractive approaches for summarization wherein emphasis is given to redundancy reduction in the utterances extracted from the conversation. Recent research reveals the dependence of specific utterances in the extraction of salient sentences from the conversation utterances. In this regard, Narayan et al [] analyzed topic distribution based on latent Dirichlet allocation []. Subsequently, Song et al [] segregated utterances into 3 labels: problem description, diagnosis, and other. In medical counseling, Quiroz et al [] and Krishna et al [] adopted the method of selecting significant utterances for summarizing medical conversations.

In aspect-based summarization, instead of an overall summary of the entire document, summaries at different aspect levels are made based on specific points of interest. These aspects could be movie reviews [-] or summarization guided by different domains [,] where the documents or the segments of the documents are tagged with these aspects. Hayashi et al [] released a benchmarking data set on multidomain aspect-based summarization where they annotated 20 different domains as aspects using the section titles and boundaries of each article chosen from Wikipedia. Frermann et al [] reported an aspect-based summarization of the news domain. Their method can segment documents by aspect, and the model can generalize from the synthetic data to natural documents. The study further revealed the models’ efficacy in summarizing long documents. Recently, aspect-based summarization has garnered considerable traction; however, the data set is limited. Yang et al [] released a large-scale, high-quality data set on aspect-based summarization from Wikipedia. The data set contains approximately 3.7 million instances covering approximately 1 million aspects sourced from 2 million Wikipedia pages. Apart from releasing the data set, the authors also benchmarked it on the Longformer-Encoder-Decoder [] model where they performed zero-shot, few-shot, and fine-tuning on 7 downstream domains where data are scarce. Joshi et al [] address the general summarization of medical dialogues. They proposed combining extractive and abstractive methods that leverage the independent and distinctive local structures formed during a patient’s medical history compilation. Liu et al [] reported a topic-based summarization of general medical domains pertaining to topics such as swelling, headache, chest pain, and dizziness. Their encoder-decoder model tries to generate 1 symptom (topic) at a time. Besides, work on formalizing the conversation text has been reported in the study by Kazi and Kahanda []. This work treats the formalization of the case notes from digital transcripts of physician-patient conversations as a summarization task. The method involves 2 steps: prediction of the electronic health record categories and formal text generation. Gundogdu et al [] used a BERT-based sequence-to-sequence model for summarizing clinical radiology reports. The experimental results indicated that at least 76% of their summary generations were as accurate as those generated by radiologists. There is also a report on topic-guided dialogue summarization for clinical physician-patient conversations []. The approach first learns the topic structure of the dialogues and uses these topics to generate the summaries in the desired format (eg, the subjective, objective, assessment, and plan format). Zhang et al [] proposed a method for factually consistent summarization of clinical dialogues. This method involves extracting factual statements and encoding them into the dialogue. In addition, a dialogue segmenter is trained to segment the dialogues based on topic switching, which enhances the model’s overall discourse awareness. Chintagunta et al [] used GPT-3 [] to generate training examples for medical dialogue summarization tasks. Recently, there have been reports of LLMs being used in medical dialogue summarization to expedite diagnosis by focusing on relevant medical facts, thereby reducing screening time []. The authors conducted benchmarking on GPT-3.5, Bidirectional and Auto-Regressive Transformer (BART) [], and BERT for Summarization []. The study indicated that GPT-3.5 generated more accurate and human-aligned responses than the other 2 models. Another study [] demonstrated the effectiveness of LLMs in clinical text summarization across 4 different tasks: physician-patient dialogue, radiology reports, patient questions, and progress notes. The quantitative analysis revealed that the summaries generated by the adapted LLMs were comparable, or even superior, in quality to those of the human experts in terms of conciseness, correctness, and completeness. Singh et al [] used open-source LLMs to extract and summarize suicide ideation indicators from social media texts to expedite mental health interventions.

Opportunities

The aforementioned previous works either did not focus on aspect-based summarization or reported on general clinical discussions of common symptoms and conditions (eg, cough, cold, and fever). However, there are still avenues to be explored in the aspect-based summarization of mental health therapy conversations, considering that mental health is a pressing global issue requiring urgent consideration. These therapy conversations encompass several counseling components, including patient information, past symptoms, diagnosis history, reflection, and the therapist’s action plans. Focusing the summaries on these counseling components would facilitate targeted and focused summaries, significantly reducing the time and effort and leading to more effective therapy overall. In this direction, our work is motivated by the study conducted by Srivastava et al [], which reported on a summarization-based counseling technique from therapist-client conversations. They released a conversation data set that is structured with the core components of psychotherapy about SH identification or the discovery of the patient’s behavior. The authors proposed an encoder-decoder model based on Text-to-Text Transfer Transformer (T5) [] for their counseling-component–guided summarization model. However, a single, generic summary is generated in the work, and no focus is given to generating aspect-based summaries. Consequently, we extended the work by using the counseling components, namely SH exploration, PD, and reflection, into an aspect-based summarization framework. To this end, we created MentalCLOUDS, a data set that incorporates summaries aligned with the distinct counseling components. We also explored the efficacy of the state-of-the-art LLMs (encoder-decoder as well as decoder-only models) for the summarization of counseling dialogues in this work.

Taxonomy

On the basis of the survey of related works on summarization in the medical domain in general and in mental health in particular, we present a taxonomy of task formulations for summarization tasks in the medical domain ( [,,-,,,,,,,-]). In general, medical text summarization is divided into research articles [-], reports, patient health questions, electronic health records, and dialogue summarization. Report summarization encompasses the summarization of reports, such as impressions or summarizations of radiology findings [,,-]. Patient health question summarization involves summarizing informal, nontechnical, and lengthy patient questions into technically sound and concise ones [-]. Electronic health record summarization includes the summarization of patient notes such as clinical progress notes [-] and discharge notes [,,-]. Our work focuses on the abstractive dialogue summarization of mental health counseling conversations, specifically targeting the counseling aspects. In addition, the survey includes general medical dialogue summarization [-,,,,] and mental health dialogue summarization [,,,]. Of note, this taxonomy does not represent the global scenario but rather provides a comprehensive depiction based on the aforementioned survey.

‎

Figure 1. Taxonomy of summarization methods in the medical domain. Challenges

Mental health counseling conversations often involve sensitive and confidential information. There is an expectation of empathetic and reflective responses from the therapist and action plans based on which the therapy is conducted. Generative AI–based counselors are susceptible to generating insensitive or incorrect suggestions and lacking empathy in their responses, which can negatively impact the therapy process. Moreover, the components or aspects of counseling sessions are subjective, and a counseling conversation can have multiple aspects. Therefore, the scope of the aspect-based summarization is limited to the specific annotated aspects. However, annotating these aspects requires expert manual intervention, which is costly both in terms of human resources and the financial perspective.

MethodsOverview of the Proposed Data Set: MentalCLOUDS

To evaluate the performance of diverse summarization systems across various aspects of counseling interactions, we expanded upon the Mental Health Summarization (MEMO) data set []. Comprising 11,543 utterances extracted from 191 counseling sessions involving therapists and patients, this data set draws from publicly accessible platforms such as YouTube. Embracing a heterogeneous demographic spectrum with distinctive mental health concerns and diverse therapists, the data set facilitates the formulation of a comprehensive and inclusive approach for researchers. Using preprocessed transcriptions derived from counseling videos, the constituent dialogues within the data set exhibit a dyadic structure, exclusively featuring patients and therapists as interlocutors. Within each conversation, 3 pivotal counseling components (aspects) emerge: SH exploration, PD, and reflective utterances.

Our study aims to capture the essence of each aforementioned counseling component, embarking on the creation of 3 distinct summaries for a single dialogue, with each summary tailored to a specific counseling component. Expanding upon the MEMO data set, we augmented it with annotated dialogue summaries corresponding to the 3 identified components. Collaborating closely with a team of leading mental health experts (for their details, refer to the Qualitative Assessment by Experts subsection), we crafted annotation guidelines and subjected the summary annotations to rigorous validation processes. We call the resultant data set MentalCLOUDS. We highlight its key statistics in and .

Table 1. Statistics of the Mental Health Counseling-Component–Guided Dialogue Summaries data set.SetDialogues (n=191), n (%)Utterances (n=11,543), n (%)Utterances per dialogue, mean (SD)Patient utterances (n=5722), n (%)Therapist utterances (n=5814), n (%)SHa utterances (n=2379), n (%)PDb utterances (5428), n (%)Reflective utterances (n=1242), n (%)Training131 (68.59)8342 (72.3)63.68 (38.44)4124 (72.1)4211 (72.4)1882 (79.1)3826 (70.5)884 (71.2)Validation21 (10.99)1191 (10.3)56.71 (27.06)594 (10.4)597 (10.3)206 (8.7)445 (8.2)146 (11.8)Test39 (20.42)2010 (17.4)51.53 (39.96)1004 (17.5)1006 (17.3)291 (12.2)1157 (21.3)212 (17.1)

aSH: symptom and history.

bPD: patient discovery.

‎

Figure 2. Distribution of summary lengths in the Mental Health Counseling-Component–Guided Dialogue Summaries (MentalCLOUDS) data set. Data Annotation ProcessGuidelines

Conversations in counseling situations can be challenging, given the sensitive nature of the information shared. A therapist’s reflective and open attitude can facilitate this expression. This dynamic is reinforced by the proposed MentalCLOUDS data set. This data set distinguishes the utterances dedicated to symptom exploration, discovering the history of mental health issues and patient behavior, as well as providing insights into past narratives, thereby shaping the patient’s present circumstances. These nuanced elements form the core of our identified counseling components. To improve the richness of the data set, we collaborated with mental health experts to formulate a set of annotation guidelines []. Furthermore, these guidelines serve as a comprehensive framework by which annotators can focus their attention on particular aspects of the conversation that are essential for producing summaries that are customized for each counseling component. By adhering to these guidelines, the therapeutic techniques are captured in the annotations. This ensures that the resulting summaries are concise yet rich in informative content for the specific component.

Psychotherapy Elements

Within the realm of mental health therapy sessions, distinct counseling components play a pivotal role in facilitating successful interventions. The MentalCLOUDS data set serves as a valuable resource, furnishing meticulously labeled utterances that encompass 3 fine-grained components []:

SH: this facet encapsulates utterances teeming with insightful information crucial for the therapist’s nuanced assessment of the patient’s situation.PD: patients entering counseling sessions often bring intricate thoughts to the fore. Therapists, in turn, endeavor to establish therapeutic connections, creating a conducive environment for patients to articulate and unravel their thoughts. Such utterances by the therapist that encourage patients to reveal their concerns lie in this category.Reflecting: therapists use concise utterances, allowing ample space for patients to share their life stories and events. Encouraging patient narratives, therapists may also use hypothetical scenarios to evaluate actions and enhance understanding.

When crafting a summary for a dialogue D, aligned with a specific counseling component C, our primary focus rests on utterances marked with C within D in the MEMO data set. Consequently, we derived 3 distinct counseling summaries for each counseling component within a single session to create the MentalCLOUDS data set. shows the data statistics, where a balanced distribution of patient and therapist utterances within the data set is evident. Notably, PD emerges as the prevailing label in the data set, highlighting patients’ inclination to discuss ancillary topics rather than focusing solely on their mental health concerns when prompted to share their experiences. By contrast, reflecting emerges as the least tagged label in this comprehensive analysis.

Benchmarking

In recent years, the spotlight on LLMs has intensified, captivated by their extraordinary performance across diverse applications. From classification tasks such as emotion recognition [] to generative problems such as response generation [], these models have proven their versatility. In this paper, our focus is directed toward evaluating their capability in the domain of counseling summarization, specifically using MentalCLOUDS. In our comprehensive analysis, we leveraged 11 state-of-the-art pretrained LLM architectures, including a mix of general-purpose and specialized models. These models are considered to carefully assess their performance concerning each facet of the counseling-component summaries. We explain each of these systems in .

This is to highlight that all baseline models are transformer based, and computational complexities associated with the transformer-based architectures while being trained or fine-tuned involve a computational cost of O(L × N2 × D), where N represents the sequence length, D denotes the hidden dimension, and L signifies the number of transform layers. As we maintain a constant number of layers across all training steps, the computational complexity simplifies to O(N2 × D).

Moreover, our selection of benchmarked models comprises both small language models (SLMs), such as BART, T5, the GPT family, Phi-2, and MentalBART, as well as LLMs such as Flan-T5, Mistral, Llama-2, and MentalLlama. SLMs typically operate within the parameter range of 300 million to 2 billion, whereas LLMs are characterized by a higher parameter count, ranging from 7 billion to 9 billion (as kept in our study). In addition to analyzing the models’ complexity for a better understanding of their applicability, another crucial metric to consider is the model’s runtime. LLMs tend to consume more runtime due to their larger parameter count, while SLMs run quickly but may compromise accuracy. A comprehensive analysis of the models’ runtime is provided in .

Textbox 1. Description of the 11 models evaluated.Bidirectional and Auto-Regressive Transformer (BART) []: this is a sequence-to-sequence model designed for various natural language processing (NLP) tasks, including text summarization. It uses a transformer architecture with an encoder-decoder structure. It incorporates a denoising autoencoder objective during pretraining, reconstructing the original input from corrupted versions. We used the pretrained base version of the model in our experiments.Text-To-Text Transfer Transformer (T5) []: this is a versatile transformer-based model consisting of an encoder-decoder framework with bidirectional transformers. It reframes all NLP tasks as text-to-text tasks, providing a unified approach. T5 learns representations by denoising corrupted input-output pairs. Its encoder captures contextual information while the decoder generates target sequences. The pretrained base version of T5 was used in our experiments.GPT-2 []: this is a transformer-based language model that comprises a stack of identical layers, each with a multihead self-attention mechanism and position-wise fully connected feed-forward networks. GPT-2 follows an autoregressive training approach, predicting the next token in a sequence given its context.GPT-Neo []: trained from the Pile data set [], GPT-Neo exhibits a similar architecture as GPT-2 except for a few modifications, such as the use of local attention in every other layer with a window size of 256 tokens. In addition, GPT-Neo houses a combination of linear attention [], a mixture of experts [], and axial positional embedding [] to achieve performance comparable to that of larger LLMs, such as GPT-3.GPT-J []: this is a transformer model trained using the methodology proposed by Wang []. It is a GPT-2–like causal language model trained on the Pile data set.FLAN-T5 []: this is the instruction fine-tuned version of the T5 model with a particular focus on scaling the number of tasks, scaling the model size, and fine-tuning on chain-of-thought data.Mistral []: this is a decoder-based LLM with a sliding-window attention mechanism, where it is trained with an 8k context length and fixed cache size, with a theoretical attention span of 128K tokens. Faster inference and lower cache are ensured by using grouped query attention [].MentalBART []: this is an open-source LLM constructed for interpretable mental health analysis with instruction-following capability. The model is fine-tuned using the Interpretable Mental Health Instruction (IMHI) data set [] and is expected to make complex mental health analyses for various mental health conditions.MentalLlama []: similar to MentalBART, MentalLlama is the counterpart of the Llama architecture but is trained on the IMHI data set. The model is fine-tuned to integrate the capability of an LLM with domain knowledge in mental health.Llama-2 []: this is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine tuning [] and reinforcement learning with human feedback [] to align with human preferences for helpfulness and safety. The model is trained exclusively on publicly available data sets.Phi-2: this is an extension of Phi-1 []. Phi-1 is a transformer-based frugal LLM with the largest variant having 1.3 billion parameters. It is trained on textbook-quality data. It emphasizes the quality of the data to compensate for its relatively small number of parameters. Phi-2 has 2.7 billion parameters, which shows comparable performances with other larger LLMs despite its smaller size.Table 2. Average runtime of models fine-tuned on Mental Health Counseling-Component–Guided Dialogue Summaries (MentalCLOUDS) for summarization tasks across 3 psychotherapy elements: symptom and history, patient discovery, and reflecting.ModelVariant or parametersTime (min)GPUaBARTbBase2.27A100T5cBase18.81A100MentalBARTBase5.94A100Flan-T5Base16.56A100GPT-2124 million6.30A100GPT-Neo1.3 billion32.98A100GPT-J6 billion44.69A100MentalLlama748.27RTX A6000+RTX A5000Mistral7 billion43.86RTX A6000+RTX A5000Phi-22.7 billion9.38A100

aGPU: graphics processing unit.

bBART: Bidirectional and Auto-Regressive Transformer.

cT5: Text-To-Text Transfer Transformer.

Ethical Considerations

The study did not involve any human subject research; hence, we did not seek ethics approval.

Results

We undertook a comprehensive evaluation of the generated session summaries across various architectures, using a dual approach of quantitative and qualitative assessments.

Quantitative AssessmentOverview

This section reports the aspect-based (psychotherapy element–based) summarization results based on the automatic evaluation scores. Given the generative nature of the task, we used standard summarization evaluation metrics such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-1, ROUGE-2, ROUGE-L, and BERT Score (BERTScore) along with their corresponding precision, recall, and F1-score values. As the F1-score accounts for precision and recall, we compared the performance of the LLMs based on F1-score values unless stated otherwise. ROUGE [] assesses the overlap of n-grams (sequences of n consecutive words) between the generated summary and reference summaries. Specifically, this metric measures the number of overlapping units such as n-grams, word sequences, and word pairs in the generated summary evaluated against the gold summary typically created by humans. ROUGE favors the candidate summary with more overlaps with reference summaries. This effectively gives more weight to matching n-grams occurring in multiple reference summaries. This work reports the unigram and bigram ROUGE (namely ROUGE-1 and ROUGE-2) and ROUGE-L evaluations. ROUGE-L takes into account the longest co-occurring n-gram between the candidate and reference summaries. BERTScore [] is harnessed to gauge the semantic coherence between the generated summaries and their ground truths. Notably, in the context of counseling summaries, which are inherently tied to a domain-specific conversation, we embarked on a meticulous qualitative examination of the generated summaries for individual counseling components.

SH Summarization

reports the automatic evaluation scores of the LLMs on the summarization task for the SH psychotherapy element. MentalLlama outperformed the other LLMs across all automatic evaluation metrics. For the ROUGE-1 metric, MentalLlama achieved an F1-score of 30.86, followed by MentalBART with an F1-score of 28.00. In terms of the ROUGE-2 metric, Mistral was comparable to MentalLlama with a difference of just 0.90 in the F1-score values. Similarly, for the ROUGE-L metric, Mistral was preceded by MentalLlama by a difference of 2.93 in the F1-score values.

Table 3. Results obtained on Mental Health Counseling-Component–Guided Dialogue Summaries (MentalCLOUDS) for the summarization task on the symptom and history psychotherapy element.ModelROUGEa-1ROUGE-2ROUGE-LBERTScoreb
PrecisionRecallF1-scorePrecisionRecallF1-scorePrecisionRecallF1-scorePrecisionRecallF1-scoreBARTc12.9128.8416.261.885.072.4710.2123.9713.1985.8185.8185.81T5d22.1619.8119.742.181.781.8516.1214.5114.3685.3885.3885.38MentalBART30.3129.0228.006.065.295.4620.8520.3419.4088.3488.3488.34Flan-T521.4533.15e24.803.846.084.5417.1526.5319.7686.9486.9486.94GPT-26.5914.628.911.062.341.425.1211.376.9383.6583.6583.65GPT-Neo9.9719.9113.011.012.301.387.8915.9110.3383.1283.1283.12GPT-J13.2229.9917.883.377.964.5910.7124.3414.4786.2886.2886.28MentalLlama33.0332.7930.868.666.507.2827.7327.3029.5589.4090.9990.99Mistral29.0726.5625.417.035.207.1925.4525.6126.6283.4285.9683.05Llama-228.4924.1723.476.404.686.6322.723.0423.6682.8683.8081.62Phi-221.2310.4213.811.891.431.7814.569.1911.2684.2582.0083.11

aROUGE: Recall-Oriented Understudy for Gisting Evaluation.

bBERTScore: Bidirectional Encoder Representations from Transformers Score.

cBART: Bidirectional and Auto-Regressive Transformer.

dT5: Text-To-Text Transfer Transformer.

eThe best results are italicized.

PD Summarization

The experimental results presented in focus on the summarization task for the PD psychotherapy element. Considering the ROUGE-1 metric, MentalLlama demonstrated superior performance compared to the other LLMs. MentalLlama achieved an F1-score of 30.95, followed by MentalBART (with an F1-score of 29.94). For the ROUGE-2 metric, GPT-J outperformed the other models, followed by MentalLlama. In addition, in terms of the ROUGE-L metric, the top 2 models with the highest F1-score values were F1 score models were MentalLlama and Mistral. Finally, MentalBART superseded the other models with an F1-score of 88.61 with respect to the BERTScore metric. Overall, the scores indicate that LLMs such as MentalLlama and MentalBART, which were pretrained on the mental domain data, show consistent superiority. Notably, the base Mistral model also performed comparably to, and sometimes better than, the models trained on the mental health domain data.

Table 4. Results obtained on Mental Health Counseling-Component–Guided Dialogue Summaries (MentalCLOUDS) for the summarization task on the patient discovery psychotherapy element.ModelROUGEa-1ROUGE-2ROUGE-LBERTScoreb
PrecisionRecallF1-scorePrecisionRecallF1-scorePrecisionRecallF1-scorePrecisionRecallF1-scoreBARTc20.8243.2426.725.9712.937.7416.3834.8221.1487.3587.3587.35T5d9.4347.2915.343.0316.905.018.3942.5813.6784.7784.7784.77MentalBART33.5129.9429.949.367.948.0623.3921.4421.1088.61e88.6188.61Flan-T521.0835.6124.444.818.895.6316.1328.2918.9486.5286.5286.52GPT-213.6636.2419.574.0811.275.9410.9329.4215.7085.2185.2185.21GPT-Neo12.9629.9317.832.325.443.229.8423.1013.6082.7282.7282.72GPT-J19.7853.3328.8512.6835.7118.7116.1243.3323.4986.4386.4386.43MentalLlama24.5643.8430.959.5526.0112.7923.7738.9829.1784.6388.9586.68Mistral22.8439.0227.548.7825.7911.3521.9035.9824.0286.6287.2884.49Llama-220.2234.726.18.4121.1310.3914.7321.4417.7978.8188.0681.48Phi-218.729.2312.455.614.444.9613.948.7310.9884.2582.0080.05

aROUGE: Recall-Oriented Understudy for Gisting Evaluation.

bBERTScore: Bidirectional Encoder Representations from Transformers Score.

cBART: Bidirectional and Auto-Regressive Transformer.

dT5: Text-To-Text Transfer Transformer.

eThe best results are italicized.

Reflecting

reports the automatic evaluation scores on the summarization task for the reflecting psychotherapy element. In terms of the ROUGE-1 metric, MentalLlama and Mistral were the best 2 models, with F1-score values of 39.52 and 38.33, respectively. Similarly, MentalLlama demonstrated its superiority over the other LLMs in terms of the ROUGE-2, ROUGE-L and BERTScore metrics. Moreover, the scores of the summarization tasks for this psychotherapy element were analogous to those of the previous 2 summarization tasks, namely SH and PD, wherein the mental health–specific LLMs exhibited their superiority over the other LLMs.

Table 5. Results obtained on Mental Health Counseling-Component–Guided Dialogue Summaries (MentalCLOUDS) for the summarization task on the reflecting psychotherapy element.ModelROUGEa-1ROUGE-2ROUGE-LBERTScorebPrecisionRecallF1-scorePrecisionRecallF1-scorePrecisionRecallF1-scorePrecisionRecallF1-scoreBARTc17.0123.0418.082.874.253.2212.6817.7913.6685.2685.2685.26T5d34.1319.3224.317.213.975.0422.9512.8216.2184.9284.9284.92MentalBART34.99e36.5434.4610.2410.6610.0724.5225.8024.2588.7088.7088.70Flan-T525.1041.4030.157.1912.038.6418.5231.0022.3687.4187.4187.41GPT-22.847.544.080.140.330.202.356.343.3982.6682.6682.66GPT-Neo1.143.971.740.000.000.001.143.971.7480.8880.8880.88GPT-J17.6038.3323.715.0713.047.1314.9832.8520.1886.9486.9486.94MentalLlama31.6854.7639.528.2611.9910.1727.1337.5926.5684.7786.9287.43Mistral29.1549.2838.338.4211.878.3424.4134.2023.4478.8379.9784.81Llama-226.9343.8131.226.109.238.2416.8220.6716.2178.9386.0582.19Phi-210.615.216.910.940.710.897.284.605.5386.9482.1784.49

aROUGE: Recall-Oriented Understudy for Gisting Evaluation.

bBERTScore: Bidirectional Encoder Representations from Transformers Score.

cBART: Bidirectional and Auto-Regressive Transformer.

dT5: Text-To-Text Transfer Transformer.

eThe best results are italicized.

Qualitative Assessment by ExpertsExpert Panel Composition and Evaluation Framework

To conduct a comprehensive expert assessment, 5 health care professionals were employed to assess the clinical appropriateness of the summaries produced by the LLMs based on the evaluation framework postulated by Sekhon et al []. Of the 5 health care professionals, 2 (40%) were clinical psychologists and 3 (60%) were psychiatrists and medical practitioners; 4 (80%) were male and 1 (20%) was female; and their ages ranged from 40 to 55 years. Furthermore, each health care professional possessed more than a decade of therapeutic experience.

The evaluation framework encompasses 6 crucial parameters: affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness. The experts evaluated each session summary against these acceptability parameters, assigning continuous ratings on a scale ranging from 0 to 2, where a higher rating signified enhanced acceptability. In addition, we incorporated a new parameter: the extent of hallucination. It is categorical: 0=extensive hallucination observed, 1=minimal hallucination observed, and 2=no hallucination observed. These evaluative dimensions are defined in .

reports the clinical experts’ scores averaged over their ratings. The clinical acceptability framework [] involves 6 parameters: affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness (refer to for more details). We selected the 3 best LLMs (MentalLlama, Mistral, and MentalBART) for the expert evaluation based on the automatic evaluation results. Notably, Mistral outperformed the other 2 LLMs across all metrics, although the other 2 LLMs were fine-tuned on mental health domain data. Overall, all raters were more aligned in rating t

View original article

JMIR MENTAL HEALTH

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study

Comments (0)