Abstract:
|
Motivated by an open problem of validating protein identities in label-free shotgun proteomics work-flows, we present a testing procedure to validate class/protein labels using available measurements across instances/peptides. More generally, we propose a non-parametric procedure that can identify instances that are deemed, based on a distance (or quasi-distance) measure, to be outliers relative to the subset of instances assigned to the same class with minimal distributional assumptions. The test is shown to simultaneously control the Type I and Type II error probabilities whilst also controlling the overall error probability of the repeated testing invoked in the validation procedure of initial class labeling. Theoretical results are supplemented with simulation study as well as an application to a proteomics data set to illustrate the applicability and viability of the method. Even with up to 25% of instances mislabeled, our testing procedure maintains a high specificity and greatly reduces the proportion of mislabeled instances.
|