Evaluating the performance of large language models on the ASPS In-Service Examination: A comparative analysis with resident norms
Document Type
Article
Publication Date
11-19-2025
Publication Title
Journal of Plastic, Reconstructive & Aesthetic Surgery
Abstract
The emergence of large language models (LLMs) has raised critical questions about their potential roles in surgical education. This study aims to evaluate the accuracy and comparative performance of three leading LLMs including ChatGPT 4.0, DeepSeek V3, and Gemini 2.5, on the American Board of Plastic Surgery Plastic Surgery In-Service Training Examination (PSITE) across a 20-year period. Our results showed that ChatGPT achieved the highest overall accuracy (75.0%), followed closely by DeepSeek (74.8%) and Gemini (74.5%), with no significant differences between models (p>0.05). When benchmarked against normative data, DeepSeek reached the highest percentile ranks (81st among residents, 89th among practitioners), followed by ChatGPT (78th and 84th), and Gemini (72nd and 90th), without significant differences in rankings across LLMs (p > 0.05). In conclusion, Modern LLMs demonstrate consistent and high-level performance on the PSITE, frequently exceeding the median performance of plastic surgery residents and practitioners.
First Page
164
Last Page
167
PubMed ID
41308375
Volume
113
Rights
© 2025 British Association of Plastic, Reconstructive and Aesthetic Surgeons. Published by Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
Recommended Citation
Shekouhi, Ramin; Holohan, Mary M.; Mirzalieva, Oygul; Byrd, Brandon; Guidry, Mairin F.; Palines, Patrick A.; and Chim, Harvey, "Evaluating the performance of large language models on the ASPS In-Service Examination: A comparative analysis with resident norms" (2025). School of Medicine Faculty Publications. 4371.
https://digitalscholar.lsuhsc.edu/som_facpubs/4371
10.1016/j.bjps.2025.11.031