Abstract:
|
Identification of rare variant associations is crucial to understanding the genetics of complex traits and diseases. Essential in this process is the evaluation of novel methods using simulated data that mirrors real data in the distribution of rare variants and haplotype structure. Additionally, using real variant annotation enables in silico comparison of methods that focus on putative causal variants, such as rare variant association tests, and polygenic scoring methods. Existing simulation methods are either unable to employ real variant annotation or severely under- or over-estimate the number of very rare variants, reducing the ability to generalize simulation results to real studies. We present RAREsim, a flexible and accurate rare variant simulation algorithm. Using parameters and haplotypes derived from real sequencing data, RAREsim efficiently simulates the expected variant distribution and enables real variant annotations. We demonstrate RAREsim’s utility across various genetic regions, sample sizes, ancestries, and variant classes. RAREsim is provided as an R package, accompanied by a vignette, simulation scripts, and reference data for easy implementation.
|