Transcription factors (TFs) bind DNA and control gene expression. Identifying TF binding sites is the first step in finding mutations that disrupt gene regulation and promote disease. ChIP-seq is the most common method for identifying them, but performing it on patient samples is hampered by the amount of available material. Existing methods for computational prediction primarily predict binding in genomic regions with known TF sequence preferences. But most binding sites don't resemble known TF sequence motifs, and many TFs are not sequence-specific.
We developed Virtual ChIP-seq, which predicts binding of individual TFs in new cell types using a neural network that integrates ChIP-seq results from other cell types and chromatin accessibility data in the new cell type. Virtual ChIP-seq uses learned associations between gene expression and TF binding at specific genomic regions. We train Virtual ChIP-seq on a concatenated matrix of genomic regions and predictive features from training cell types and evaluate the performance on each of the validation cell types. Virtual ChIP-seq outperforms position weight matrix methods, predicting binding with MCC > 0.3 for 31 TFs.