Abstract:
|
This study investigates a set of proteomic data, collected from the records of 216 individuals: 121 of those with cancer and 95 healthy volunteers. For each individual, there are 368,749 pieces of the spectra in the raw data. In our investigation, we use a technique of Dynamic Binning to merge adjacent spectra by assigning similar compounds to the same spectrum without minimizing peak resolution. The process reduced the raw data from 1.16 Gb to 9.3Mb in 5,155 bins. Our study compares the effect of this technique with other types of binning.
Within each bin, one can take mean, max, SD, moving average and other types of statistics for predictive modeling. This study compares the efficiencies of these statistics in the prediction of cancer patients. Furthermore, the study investigates the effects of Variable Selection via False Discover Rate as discussed in Efron (2010, 2008) and Benjamini and Hockberg (1995). In addition, we used various techniques from Dudoit, Shaffer, and Boldrick (2003). We compare these results with the variables selected by Decision Tree, Stochastic Gradient, Regression, and Partial Least Squares.
|