Abstract:
|
High-throughput biological data has become an useful tool for understanding intricate biological systems in the past few decades. But, the resulting data has extremely high dimensionality, making it difficult to detect true associations amidst random noise. Several data mining tools, such as SVM and Random Forest, have sprung up to handle such analyses. These data mining tools are primarily focused upon prediction, but they are inconsistent when used for variable selection. Irreproducible Discovery Rate (IDR) has been proposed as method to better identify important variables in high dimensional biological data. We explore its use on large, sparse, high-dimensional datasets to increase the accuracy and consistency of variable importance measures used in data mining.
|