Evaluating Large Language Models on Aerospace Medicine Principles

Kyle D. Anderson, Emory School of Medicine, Flowery Branch, GA
Cole A. Davis, LSU Health Sciences Center - New OrleansFollow
Shawn M. Pickett, Landauer Medical Physics, Glenwood, IL
Michael S. Pohlen, Stanford University School of Medicine, Stanford, CA

Document Type

Article

Publication Date

4-28-2025

Publication Title

Wilderness & Environmental Medicine

Abstract

IntroductionLarge language models (LLMs) hold immense potential to serve as clinical decision-support tools for Earth-independent medical operations. However, the generation of incorrect information may be misleading or even harmful when applied to care in this setting.MethodTo better understand this risk, this work tested two publicly available LLMs, ChatGPT-4 and Google Gemini Advanced (1.0 Ultra), as well as a custom Retrieval-Augmented Generation (RAG) LLM on factual knowledge and clinical reasoning in accordance with published material in aerospace medicine. We also evaluated the consistency of the two public LLMs when answering self-generated board-style questions.ResultsWhen queried with 857 free-response questions from Aerospace Medicine Boards Questions and Answers, ChatGPT-4 had a mean reader score from 4.23 to 5.00 (Likert scale of 1-5) across chapters, whereas Gemini Advanced and the RAG LLM scored 3.30 to 4.91 and 4.69 to 5.00, respectively. When queried with 20 multiple-choice aerospace medicine board questions provided by the American College of Preventive Medicine, ChatGPT-4 and Gemini Advanced responded correctly 70% and 55% of the time, respectively, while the RAG LLM answered 85% correctly. Despite this quantitative measure of high performance, the LLMs tested still exhibited gaps in factual knowledge that potentially could be harmful, a degree of clinical reasoning that may not pass the aerospace medicine board exam, and some inconsistency when answering self-generated questions.ConclusionThere is considerable promise for LLM use in autonomous medical operations in spaceflight given the anticipated continued rapid pace of development, including advancements in model training, data quality, and fine-tuning methods.

PubMed ID

40289627

Recommended Citation

Anderson, Kyle D.; Davis, Cole A.; Pickett, Shawn M.; and Pohlen, Michael S., "Evaluating Large Language Models on Aerospace Medicine Principles" (2025). School of Medicine Faculty Publications. 3789.
https://digitalscholar.lsuhsc.edu/som_facpubs/3789
10.1177/10806032251330628

Link to Full Text

Find in your library

COinS

DOI

10.1177/10806032251330628

Evaluating Large Language Models on Aerospace Medicine Principles

Document Type

Publication Date

Publication Title

Abstract

PubMed ID

Recommended Citation

DOI

Search

Browse

Author Corner

Links

School of Medicine Faculty Publications

Evaluating Large Language Models on Aerospace Medicine Principles

Authors

Document Type

Publication Date

Publication Title

Abstract

PubMed ID

Recommended Citation

Share

DOI

Search

Browse

Author Corner

Links