Abstract:
|
Biosynthetic gene clusters (BGCs) in bacterial genomes code for important small molecules and secondary metabolites. Based on the validated BGCs, similarity of protein family domains (Pfam) and Pfam functions, we develop a deep learning method, BIGclass, for detectign the BGCs and their classes. We show that BIGclass leads to reduced false positive rates in BGC identification and an improved ability to extrapolate and identify novel BGCs compared to existing methods. We apply BIGclass to 5,666 RefSeq bacterial genomes and predicted a total of 170,685 BGCs from these genomes. Each genome, on average, has 30.1 predicted BGCs, ranging from 0 to 243. We summarize all the predicted BGCs, their functional classes and the distributions of the BGCs in different bacterial phyla. Applications of the BGCs in disease studies will be presented.
|