466 – Biodata Methods
Statistical Modeling of Genomic Words and Motifs
Guozhu Zhang
North Carolina State University
Stephen Lee
University of Idaho
The arrangement of the four nucleotides A, C, G, and T along the genome is known to be non-random. Vast amount of information are built into the complex arrangements and compositions of genomic nucleotides. It can be viewed as a book of nucleotide text of instructions at the cellular level. Genome is decoded as a continuous stream of nucleotide alphabets message as one read the genomic text. We approach the reading of genomic text by segmentation - dividing the continuous stream into chunks according to some statistical measures of homogeneity. The goal would be to segment the genome into the most probable dictionary of motifs or words. Words are defined by our segmentation method as more homogeneous units within the boundaries than without. The core idea of this paper is to introduce the method of setting word boundaries. We applied the method to compare the yeast and worm genomes, to distinguish ordered and disordered protein sequences, and to characterize different English texts.
