Text Report Analysis to Identify Opportunities for Optimizing Target Selection for Chest Radiograph Artificial Intelligence Models

Document Type


Publication Date


Publication Title

Journal of Imaging Informatics in Medicine


Our goal was to analyze radiology report text for chest radiographs (CXRs) to identify imaging findings that have the most impact on report length and complexity. Identifying these imaging findings can highlight opportunities for designing CXR AI systems which increase radiologist efficiency. We retrospectively analyzed text from 210,025 MIMIC-CXR reports and 168,949 reports from our local institution collected from 2019 to 2022. Fifty-nine categories of imaging finding keywords were extracted from reports using natural language processing (NLP), and their impact on report length was assessed using linear regression with and without LASSO regularization. Regression was also used to assess the impact of additional factors contributing to report length, such as the signing radiologist and use of terms of perception. For modeling CXR report word counts with regression, mean coefficient of determination, R2, was 0.469 ± 0.001 for local reports and 0.354 ± 0.002 for MIMIC-CXR when considering only imaging finding keyword features. Mean R2 was significantly less at 0.067 ± 0.001 for local reports and 0.086 ± 0.002 for MIMIC-CXR, when only considering use of terms of perception. For a combined model for the local report data accounting for the signing radiologist, imaging finding keywords, and terms of perception, the mean R2 was 0.570 ± 0.002. With LASSO, highest value coefficients pertained to endotracheal tubes and pleural drains for local data and masses, nodules, and cavitary and cystic lesions for MIMIC-CXR. Natural language processing and regression analysis of radiology report textual data can highlight imaging targets for AI models which offer opportunities to bolster radiologist efficiency.

First Page


Last Page


PubMed ID