Examination Date

5-9-2024

Degree

Dissertation

Degree Program

Biostatistics

Examination Committee

Zhide Fang, PhD; Donald Mercante, PhD; Hui-Yi Lin, PhD; Evrim Oral, PhD; Jian Li PhD

Abstract

Bulk RNA Sequencing (RNA-Seq) and single-cell RNA Sequencing (scRNA-Seq) are commonly used for identifying biomarker genes that are differently expressed between study conditions. Apart from known biological conditions, both bulk RNA-Seq and scRNA-Seq datasets are affected by hidden factors that are not controlled during experiment. Such hidden factors may distort the true biological signal of interest, leading to spurious detection of marker genes if not properly addressed during differential expression (DE) analysis. This problem further intensifies when known and hidden factors are correlated.

RNA-Seq datasets contain the gene expression counts which are often affected by sequencing process bias and follow a heteroscedastic distribution. In Chapter 2, we propose a weighted least square (WLS)-based framework called improved surrogate variable analysis for RNA-Seq (iSVAr ) to adjust for hidden factors in the DE analysis of bulk RNA-Seq data, regardless of their correlation status with known factors. Additionally, we propose a method called modified voom (MV) transformation within iSVAr to mitigate sequencing bias and heteroscedasticity, also to enhance power of DE analysis with mean based hypothesis tests. In Chapter 3, we evaluate and compare the effectiveness of iSVAr with widely used methods svaseq and RUVSeq using both simulated and real-world bulk RNA-Seq datasets.

In scRNA-Seq, where gene expressions are measured at single-cell level, it is often imperative to identify marker genes that are associated with the key hidden factor, such as unknown cell types. However, cell types may be correlated with other hidden factors such as cell cycle phases. In this case, we aim to estimate hidden factors while maintaining their inherent correlation structure and correlation with known factors. In Chapter 4, we propose a WLS-based framework viii called improved surrogate variable analysis for scRNA-Seq (iSVAscr) to estimate the hidden factors and identify marker genes in scRNA-Seq data. Next, we assess the effectiveness of iSVAscr in terms of accuracy, precision, and cluster purity, and compare it with existing methods using both simulated and real-world scRNA-Seq datasets. Chapter 5 provides discussions and conclusions on the studies presented in this dissertation, along with future work focusing on extending iSVAscr to more recent spatial transcriptomics.

Mazumder signature page.pdf (71 kB)
Exam Committee signature page

Share

COinS