School of Public Health Faculty Publications

A Keyword-enhanced Approach To Handle Class Imbalance In Clinical Text Classification

Andrew E. Blanchard, Oak Ridge National Laboratory
Shang Gao, Oak Ridge National Laboratory
Hong Jun Yoon, Oak Ridge National Laboratory
J. Blair Christian, Oak Ridge National Laboratory
Eric B. Durbin, University of Kentucky College of Medicine
Xiao Cheng Wu, LSU Health Sciences Center - New Orleans
Antoinette Stroup, Rutgers Cancer Institute of New Jersey
Jennifer Doherty, Huntsman Cancer Institute
Stephen M. Schwartz, Fred Hutchinson Cancer Research Center
Charles Wiggins, The University of New Mexico
Linda Coyle, Information Management Services, Inc.
Lynne Penberthy, National Cancer Institute (NCI)
Georgia D. Tourassi, Oak Ridge National Laboratory

Document Type

Article

Publication Date

1-12-2022

Publication Title

IEEE Journal of Biomedical and Health Informatics

Abstract

Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.

First Page

2796

Last Page

2803

PubMed ID

35020599

Volume

Issue

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Recommended Citation

Blanchard, Andrew E.; Gao, Shang; Yoon, Hong Jun; Christian, J. Blair; Durbin, Eric B.; Wu, Xiao Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen M.; Wiggins, Charles; Coyle, Linda; Penberthy, Lynne; and Tourassi, Georgia D., "A Keyword-enhanced Approach To Handle Class Imbalance In Clinical Text Classification" (2022). School of Public Health Faculty Publications. 131.
https://digitalscholar.lsuhsc.edu/soph_facpubs/131
10.1109/JBHI.2022.3141976

File Format

pdf

File Size

3273 KB

Download

Find in your library

Included in

Clinical Epidemiology Commons, Clinical Trials Commons

COinS

DOI

10.1109/JBHI.2022.3141976

A Keyword-enhanced Approach To Handle Class Imbalance In Clinical Text Classification

Document Type

Publication Date

Publication Title

Abstract

First Page

Last Page

PubMed ID

Volume

Issue

Creative Commons License

Recommended Citation

File Format

File Size

Included in

DOI

Search

Browse

Author Corner

Links

School of Public Health Faculty Publications

A Keyword-enhanced Approach To Handle Class Imbalance In Clinical Text Classification

Authors

Document Type

Publication Date

Publication Title

Abstract

First Page

Last Page

PubMed ID

Volume

Issue

Creative Commons License

Recommended Citation

File Format

File Size

Included in

Share

DOI

Search

Browse

Author Corner

Links