Abstract:
|
As of January 2022, the GISAID database contains more than one million SARS-CoV-2 genomes, including more than ten thousand nucleotide sequences of the recently discovered omicron variant. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemic. We are interested in investigating whether nucleotide sequences cluster according to their variant (or other characteristics, such as clade), which we assess with an unsupervised cluster analysis applied to the SARS-CoV-2 genomes. To this end, we first assess the similarity of all pairs of genomes using the Jaccard index and principal component analysis. We show that indeed, nucleotide sequences cluster according to certain characteristics such as the strain. More importantly, we show that nucleotide sequences of the omicron variant are outliers in clusters of sequences stemming from variants identified earlier in the pandemic. This finding suggests that outlier detection might be a useful tool to identify emerging variants in real time as the pandemic progresses.
|