Mining valid scientific discoveries from genomic data is always hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed “batch effects,” and the latter is often modeled by subtypes. However, there is a lack of research on the correction of batch effects with the presence of unknown subtypes. Here, we propose a novel model BUS to simultaneously correct batch effects, cluster samples into subtypes, identify features that distinguish subtypes, allow the number of subtypes to vary from batch to batch, and enjoy a linear-order computation complexity. We prove the identifiability of BUS and provide study designs under which batch effects can be corrected.
When combining real datasets, as the true subtype of each sample is unknown, it is difficult to evaluate the performance of clustering. Very fortunately, the GSE109059 paired microRNA datasets designed by Qin et al (2018, Sci Data) assayed each sample twice in two batches, thus providing an unprecedented opportunity to evaluate the accuracy of clustering and batch effects correction. The subtyping by BUS is highly concordant for the same biological sample profiled in the two batches.