Many modern neural networks are trained in an over-parameterized regime where the parameters of the model exceed the size of the training dataset. Training these models involve highly non-convex landscapes and it is not clear how methods such as (stochastic) gradient descent provably find globally optimal models. Furthermore, due to their over-parameterized nature these neural networks in principle have the capacity to (over)fit any set of labels including pure noise. Despite this high fitting capacity, somewhat paradoxically, neural networks models trained via first-order methods continue to predict well on yet unseen test data. In this talk I will discuss some results aimed at demystifying such phenomena by demonstrating that gradient methods enjoy a few intriguing properties: (1) when initialized at random the iterates converge at a geometric rate to a global optima, (2) among all global optima of the loss the iterates converge to one with a near minimal distance to the initial estimate, (3) are provably robust to noise/corruption/shuffling on a fraction of the labels with these algorithms only fitting to the correct labels and ignoring the corrupted labels.