School of Public Health Faculty Publications

Class Imbalance In Out-of-distribution Datasets: Improving The Robustness Of The Textcnn For The Classification Of Rare Cancer Types

Kevin De Angeli, Oak Ridge National Laboratory
Shang Gao, Oak Ridge National Laboratory
Ioana Danciu, Oak Ridge National Laboratory
Eric B. Durbin, University of Kentucky College of Medicine
Xiao Cheng Wu, LSU Health Sciences Center - New OrleansFollow
Antoinette Stroup, Rutgers Cancer Institute of New Jersey
Jennifer Doherty, Huntsman Cancer Institute
Stephen Schwartz, Fred Hutchinson Cancer Research Center
Charles Wiggins, The University of New Mexico
Mark Damesyn, California Department of Public Health
Linda Coyle, Information Management Services, Inc.
Lynne Penberthy, National Cancer Institute (NCI)
Georgia D. Tourassi, Oak Ridge National Laboratory
Hong Jun Yoon, Oak Ridge National Laboratory

Document Type

Article

Publication Date

11-22-2021

Publication Title

Journal of Biomedical Informatics

Abstract

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.

PubMed ID

34823030

Volume

125

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

Recommended Citation

De Angeli, Kevin; Gao, Shang; Danciu, Ioana; Durbin, Eric B.; Wu, Xiao Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen; Wiggins, Charles; Damesyn, Mark; Coyle, Linda; Penberthy, Lynne; Tourassi, Georgia D.; and Yoon, Hong Jun, "Class Imbalance In Out-of-distribution Datasets: Improving The Robustness Of The Textcnn For The Classification Of Rare Cancer Types" (2021). School of Public Health Faculty Publications. 270.
https://digitalscholar.lsuhsc.edu/soph_facpubs/270
10.1016/j.jbi.2021.103957

Download

Find in your library

Included in

Epidemiology Commons, Oncology Commons

COinS

DOI

10.1016/j.jbi.2021.103957

Class Imbalance In Out-of-distribution Datasets: Improving The Robustness Of The Textcnn For The Classification Of Rare Cancer Types

Document Type

Publication Date

Publication Title

Abstract

PubMed ID

Volume

Creative Commons License

Recommended Citation

Included in

DOI

Search

Browse

Author Corner

Links

School of Public Health Faculty Publications

Class Imbalance In Out-of-distribution Datasets: Improving The Robustness Of The Textcnn For The Classification Of Rare Cancer Types

Authors

Document Type

Publication Date

Publication Title

Abstract

PubMed ID

Volume

Creative Commons License

Recommended Citation

Included in

Share

DOI

Search

Browse

Author Corner

Links