Abstract:
|
The outcome of high-throughput biological experiments often has a lot of missing observations when the signals are below the detection level. For example- the majority of reported expression levels in single cell RNA-seq are zeros. The existing methods for reproducibility assessment do not take account of the missing values, leading to biased results. In this paper, we study how the reproducibility of high-throughput experiments is affected by the choices of operational factors (e.g. platform or sequencing depth), when a large amount of measurements is missing. Using a latent variable approach, we extend the correspondence curve regression to incorporate missing values. Our approach estimates the independent effects of covariates on reproducibility and the amount of missing data. Using simulations, we show that our method is more accurate in detecting difference in reproducibility than existing measures of reproducibility. We illustrate the usefulness of our method using a study of HCT116 cells from scRNA-seq libraries made using microfluidic and tube-based methods. We also determine the cost-effective sequencing depth that is required to achieve sufficient reproducibility.
|