Abstract:
|
The issue of population stratification remains a challenging problem in genome-wide association studies. The sample of genome data is often stratified and contaminated by outliers. Benford's law, also called Newcomb-Benford's law and first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. Benford's law has been applied to fraud detection for different types of datasets (i.e., tax fraud, election survey, etc.). When the dataset is free from error or fabrication, the first digit should follow the Benford distribution. When a dataset is artificially modified or is contaminated by outliers, the digits distribution would not follow the Benford distribution exactly. This study proposes an outlier detection method for the genotype data by using Benford's law. We test the accuracy of the new method by applying it to datasets with genuine or simulated outliers. We also compare the performance of Benford's law based outlier detection against other existing approaches (e.g., PCA methods). We believe that the new approach will be a promising contribution which helps to detect population stratification more accurately.
|