Safety Evaluation Metrics of LLM-Powered Mental Health Chatbot
Poster Number: P96
Presentation Time: 05:00 PM - 06:30 PM
Abstract Keywords: Large Language Models (LLMs), Evaluation, Standards
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
Our study emphasizes the development of comprehensive evaluation metrics for assessing the clinical safety of Large Language Models (LLMs) in mental health care, addressing the gap in current methodologies that often prioritize technical performance over safety and ethical considerations. We proposed a collaborative framework involving mental health and LLM experts to enhance the reliability of health chatbots.
A two-phase evaluation process was established, beginning with the creation of 100 benchmark questions reflecting real-life clinical scenarios, followed by an expert review using five guideline questions in the following areas: adherence to practice guidelines, health risk management, consistency in critical situations, assessment of resource provision, and empowerment of users in managing their health. This method was applied to assess ChatGPT-3.5's responses across various clinical situations, resulting in an overall average score of 7.2 from mental health experts.
The findings highlight the need for a standardized approach to chatbot evaluation. Our framework lays the groundwork for future work in developing metrics for accuracy, empathy, and privacy, aiming for the responsible integration of chatbots into healthcare and building trust among users and professionals.
Speaker(s):
Jung In Park, PhD, RN, FAMIA
UC Irvine
Author(s):
Jung In Park, PhD, RN, FAMIA - UC Irvine;
Poster Number: P96
Presentation Time: 05:00 PM - 06:30 PM
Abstract Keywords: Large Language Models (LLMs), Evaluation, Standards
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
Our study emphasizes the development of comprehensive evaluation metrics for assessing the clinical safety of Large Language Models (LLMs) in mental health care, addressing the gap in current methodologies that often prioritize technical performance over safety and ethical considerations. We proposed a collaborative framework involving mental health and LLM experts to enhance the reliability of health chatbots.
A two-phase evaluation process was established, beginning with the creation of 100 benchmark questions reflecting real-life clinical scenarios, followed by an expert review using five guideline questions in the following areas: adherence to practice guidelines, health risk management, consistency in critical situations, assessment of resource provision, and empowerment of users in managing their health. This method was applied to assess ChatGPT-3.5's responses across various clinical situations, resulting in an overall average score of 7.2 from mental health experts.
The findings highlight the need for a standardized approach to chatbot evaluation. Our framework lays the groundwork for future work in developing metrics for accuracy, empathy, and privacy, aiming for the responsible integration of chatbots into healthcare and building trust among users and professionals.
Speaker(s):
Jung In Park, PhD, RN, FAMIA
UC Irvine
Author(s):
Jung In Park, PhD, RN, FAMIA - UC Irvine;
Safety Evaluation Metrics of LLM-Powered Mental Health Chatbot
Category
Poster - Regular
Description
Date: Tuesday (11/12)
Time: 05:00 PM to 06:30 PM
Room: Grand Ballroom (Posters)
Time: 05:00 PM to 06:30 PM
Room: Grand Ballroom (Posters)