Classification performance is usually estimated by comparing a set of predicted outcomes to corresponding ground truth labels. However, acquisition of the ground truth can be difficult, unethical, or impossible in many cases (e.g., medical applications). We derive lower bounds on sensitivity and specificity of a new test (e.g., a biomarker or an AI classifier) based on a reference test with known performance and the predictions of the two tests on the same unlabeled data. We also propose hypotheses tests for comparing the performance of a new test with a reference test when the ground truth labels are unavailable. Our methods are model-free and rely only on basic assumptions that can be reasonably expected to hold about the dependency between the outcomes of the two tests. We perform simulations as well as case studies on real data to demonstrate the performance of our methods, and to compare them to alternative approaches. This methodology is potentially useful in assessing whether a new test meets a pre-specified performance goal or is superior in performance to a reference test when a dataset with ground truth is not available, but a reference test with known performance exists.