Medical Ethics Assessment of Large Language Models

Quick Navigation: Overview • Research Objective • Methodology • Key Findings • Impact & Implications • Technical Contributions • Conclusion

Overview

Assessed the ethical reasoning capabilities of large language models in clinical contexts, an underexplored dimension of Responsible AI, revealing critical limitations in their reliability and depth for ethically sensitive healthcare tasks.

Course: Deep Natural Language Understanding, University of Colorado Boulder Status: Completed

Research Objective

As LLMs are increasingly applied in clinical contexts for documentation and decision support, their ability to understand and reason about medical ethics remains an open question. This study systematically evaluates whether current models can navigate ethical trade-offs that arise in day-to-day medical practice.

Methodology

Model Evaluation

Evaluated four representative LLMs across different model types and training focuses:

Mistral-7B: General-purpose baseline model
BioMistral-7B: Domain-adapted version fine-tuned on biomedical data
DeepSeek-R1-Distill-Qwen-1.5B: Instruction-tuned open-source model
ChatGPT-4o-mini: State-of-the-art proprietary model

Key Comparisons:

Domain adaptation effects (Mistral vs. BioMistral)
Open-source vs. proprietary model capabilities
Answer accuracy vs. reasoning quality

Evaluation Framework

Question Set: Curated 100 multiple-choice questions spanning high-stakes ethical categories:

Informed consent
Surrogate decision-making
Disclosure and confidentiality
Reproductive rights
Public health obligations
Professional impairment
End-of-life care

Prompting Strategies:

Zero-shot: Direct answer selection
Few-shot: 3-5 examples with gold-standard rationales
RAG-enhanced: Retrieval-Augmented Generation using AMA Code of Medical Ethics

Retrieval-Augmented Generation Pipeline

Implemented RAG using the AMA Code of Medical Ethics as external knowledge source:

Embedding Model: PubMedBERT MS MARCO for biomedical text retrieval
Retrieval Process: Top-10 similar chunks using cosine similarity
Re-ranking: MiniLM cross-encoder for semantic relevance
Integration: Top-ranked sections prepended to model prompts

Behavioral Probing Techniques

Confidence Probing (Open-Source Models)

Extracted token-level logits and computed softmax probabilities
Measured confidence in each answer choice, including incorrect selections
Identified cases of confidently wrong predictions

Reasoning Probing (ChatGPT & DeepSeek)

Collected natural language rationales via “letter + reason” prompts
Two-step verification:
1. Answer Match: Consistency between stated choice and generated answer
2. Reasoning Match: Alignment with gold-standard ethical principles and clinical logic

Key Findings

Performance Results

Model Accuracy:

BioMistral outperformed Mistral, demonstrating benefit of domain adaptation
ChatGPT-4o-mini achieved highest overall accuracy and reasoning quality
Even best model failed on nearly 1 in 5 cases, often with high confidence

RAG Impact:

Minimal performance gains from external ethics guidelines
Retrieved content either already known or too general to be useful
Suggests context alone insufficient without application capability

Few-Shot Learning:

Mixed results: improved Mistral accuracy
Slightly reduced BioMistral performance (possible interference with domain training)
Limited overall effectiveness

Critical Limitations Identified

Reliability Issues:

Models made serious errors on ethically complex scenarios
High-confidence incorrect answers pose safety risks
Failures particularly common with conflicting ethical principles

Reasoning Depth:

Partial knowledge of medical ethics principles
Inability to reliably apply principles to novel situations
Surface-level pattern matching vs. true ethical understanding

Contextual Application:

Models struggle to navigate trade-offs between competing ethical values
Difficulty with nuanced, real-world clinical scenarios
External knowledge integration insufficient without reasoning capability

Research Questions Addressed

How well do current LLMs identify the most ethical course of action in clinical settings?
- Partial competence with significant failures, especially on complex cases
Does domain adaptation or external knowledge retrieval improve ethical performance?
- Domain adaptation helps modestly; RAG provides minimal benefit
Do model confidence scores and natural language rationales provide reliable signals of ethical soundness?
- No—models can be confidently wrong; rationales often miss critical reasoning

Impact and Implications

Societal Impact

Early Warning Signal: Systematic evidence of LLM failures in medical-ethical reasoning enables development of safer clinical decision-support tools

Clinical Safety: Highlights dangers of over-reliance on LLMs in high-stakes settings, prompting:

Rigorous human oversight requirements
Additional safety checks before deployment
Domain-specific alignment investments

Broader Implications

Current State: LLMs exhibit partial understanding of medical ethics but lack reliability for clinical use
Deployment Risk: Ethically flawed suggestions could jeopardize patient safety and erode public trust
Research Needs: Better training methods, stronger grounding in ethical principles, and more thoughtful evaluation required

Technical Contributions

Curated Evaluation Benchmark: 100-question dataset covering diverse high-stakes ethical scenarios
Dual Probing Framework: Combined confidence analysis and reasoning evaluation for comprehensive assessment
RAG Pipeline for Ethics: Retrieval system using professional medical ethics guidance
Systematic Comparison: Domain adaptation, few-shot learning, and external knowledge effects

Conclusion

This study demonstrates that current language models show partial understanding of medical ethics principles but are not yet reliable for clinical applications. Simply adding more context through retrieval or examples is insufficient—models need fundamental improvements in how they reason about and apply ethical principles in complex situations.

The findings serve as a critical foundation for future work on trustworthy medical AI, emphasizing the need for human oversight, specialized training, and robust safety mechanisms before LLMs can be responsibly deployed in healthcare settings.

This research contributes to the responsible development of AI systems in healthcare by identifying specific limitations and risks in current LLM approaches to medical ethics.