School of Public Health Faculty Publications

Optimal Vocabulary Selection Approaches For Privacy-preserving Deep Nlp Model Training For Information Extraction And Cancer Epidemiology

Hong Jun Yoon, Oak Ridge National Laboratory
Christopher Stanley, Oak Ridge National Laboratory
J. Blair Christian, Oak Ridge National Laboratory
Hilda B. Klasky, Oak Ridge National Laboratory
Andrew E. Blanchard, Oak Ridge National Laboratory
Eric B. Durbin, University of Kentucky College of Medicine
Xiao Cheng Wu, LSU Health Sciences Center - New OrleansFollow
Antoinette Stroup, Rutgers Cancer Institute of New Jersey
Jennifer Doherty, Huntsman Cancer Institute
Stephen M. Schwartz, Fred Hutchinson Cancer Research Center
Charles Wiggins, The University of New Mexico
Mark Damesyn, California Department of Public Health
Linda Coyle, Information Management Services, Inc.
Georgia D. Tourassi, Oak Ridge National Laboratory

Document Type

Article

Publication Date

2-14-2022

Publication Title

Cancer Biomarkers

Abstract

Background: With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. Objective: The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The Objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients' information to mitigate confidentiality breaches. METHODS: The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments. Results: The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.

First Page

185

Last Page

198

PubMed ID

35213361

Volume

Issue

Recommended Citation

Yoon, Hong Jun; Stanley, Christopher; Christian, J. Blair; Klasky, Hilda B.; Blanchard, Andrew E.; Durbin, Eric B.; Wu, Xiao Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen M.; Wiggins, Charles; Damesyn, Mark; Coyle, Linda; and Tourassi, Georgia D., "Optimal Vocabulary Selection Approaches For Privacy-preserving Deep Nlp Model Training For Information Extraction And Cancer Epidemiology" (2022). School of Public Health Faculty Publications. 268.
https://digitalscholar.lsuhsc.edu/soph_facpubs/268
10.3233/CBM-210306

Link to Full Text

Find in your library

COinS

DOI

10.3233/CBM-210306

Optimal Vocabulary Selection Approaches For Privacy-preserving Deep Nlp Model Training For Information Extraction And Cancer Epidemiology

Document Type

Publication Date

Publication Title

Abstract

First Page

Last Page

PubMed ID

Volume

Issue

Recommended Citation

DOI

Search

Browse

Author Corner

Links

School of Public Health Faculty Publications

Optimal Vocabulary Selection Approaches For Privacy-preserving Deep Nlp Model Training For Information Extraction And Cancer Epidemiology

Authors

Document Type

Publication Date

Publication Title

Abstract

First Page

Last Page

PubMed ID

Volume

Issue

Recommended Citation

Share

DOI

Search

Browse

Author Corner

Links