TellWell
← Back to feed
Health1h ago85% confidenceConfidence 85% — the share of independent, credible sources corroborating the core facts.

Study Finds Large Language Models Unreliably Sensitive to Prompt Changes in Healthcare Applications

1 source

A new arXiv study systematically evaluated how sensitive large language models are to minor changes in how medical questions are phrased, finding that both general-purpose and medical-specific LLMs can produce different clinical advice based on rewording alone. The researchers tested models like GPT-3.5, Llama3, and ClinicalBERT using the MedMCQA benchmark and categorized perturbations into natural and adversarial types. The findings raise serious safety concerns for deploying these models in clinical settings, where inconsistent outputs could lead to incorrect diagnoses or dangerous medication recommendations.

Researchers conducted a systematic sensitivity analysis of large language models used in healthcare, examining both general-purpose models (GPT-3.5, Llama3) and medical-specific variants (ClinicalBERT, BioLlama3, BioBERT) on the MedMCQA benchmark. The study found that even minor variations in how clinical questions are phrased—through lexical substitutions, syntactic reordering, or misleading contextual cues—can cause models to produce different clinical outputs. While models showed some resilience to simple paraphrasing, they frequently failed under syntactic reordering or adversarial prompts designed to manipulate responses. The researchers documented cases where these perturbations led to clinically dangerous outputs, including incorrect dosage recommendations and omission of critical findings. The study concludes that the unpredictability of LLMs in healthcare contexts is unacceptable for clinical deployment, as models that change diagnoses based on rewording cannot be reliably trusted by clinicians.

What's missing

The study does not specify the exact size or composition of the test set used, the statistical significance thresholds applied, or whether results were validated on datasets beyond MedMCQA. Additionally, the paper does not discuss potential mitigation strategies or how findings compare to human clinician performance under similar prompt variations.

What different sources said

  • When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

Related

HealthConfidence 75% — the share of independent, credible sources corroborating the core facts.

Bipartisan Insulin Price Cap Legislation Gains Congressional Support

Bipartisan legislation to cap insulin costs at $35 per month for people with private insurance gained four new co-sponsors, including Republican and Democratic senators. The bill was originally introduced in March and is building momentum in Congress. The legislation addresses concerns about insulin affordability, a critical issue for millions of Americans with diabetes.

1 source1m ago
HealthConfidence 75% — the share of independent, credible sources corroborating the core facts.

Medical School Organizations Agree to Increase Nutrition Requirements in U.S. Medical Education

The Department of Health and Human Services announced that eight medical school accrediting organizations have agreed to increase nutrition requirements across all levels of medical education and training. This initiative aligns with RFK Jr.'s nutrition policy priorities following his appointment to a health-related position. The move aims to ensure medical professionals receive more comprehensive training in nutrition, a field historically underemphasized in medical curricula.

1 source1m ago
HealthConfidence 88% — the share of independent, credible sources corroborating the core facts.

Hantavirus Outbreak on Cruise Ship MV Hondius: Three Deaths Confirmed, Passengers Repatriated

Three people have died and nine others are confirmed or probable cases of hantavirus following an outbreak aboard the cruise ship MV Hondius, which docked in Tenerife. The Andes strain contracted by passengers is rare and primarily transmitted through rodent contact, though limited human-to-human transmission is possible under close, prolonged contact. Passengers are being repatriated to their home countries with varying isolation protocols based on infection status and symptoms.

1 source1m ago