Abstract:
|
Deep learning has revealed a surprising statistical phenomenon: the possibility of 'benign' overfitting. Experiments have shown that trained neural networks are capable of simultaneously (1) overfitting to datasets that have substantial amounts of random label noise and (2) generalizing well to unseen data, a behavior that is inconsistent with the familiar bias-variance tradeoff in classical statistics. In this talk we investigate this phenomenon theoretically for two-layer neural networks trained by gradient descent on the cross-entropy loss. We assume the data comes from well-separated class-conditional distributions and allow for a constant fraction of the training labels to be corrupted by an adversary. We show that in this setting, neural networks indeed exhibit benign overfitting: despite the non-convex nature of the optimization problem, the empirical risk is driven to zero, overfitting the noisy labels; and as opposed to the classical intuition, the networks simultaneously generalize near-optimally. In contrast to previous works on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynam
|