Collecting and annotating data is typically the most expensive part of developing supervised Deep Learning (DL) algorithms. It is also widely accepted that adding more training data improves DL model performance. However, we usually have no idea how much data or which data is required to achieve a given model desideratum/performance goal.
This paper presents a method that fits a power law and inverted power law models to data points consisting of given performance (e.g. accuracy, error, F1 score, etc.) calculated using training sets with increasing sizes. The fitted models are then used to 1) predict the model performance on larger sample sizes than available and 2) predict the needed training dataset size based on a desired model performance. Important aspects in this process are the choice of the different data set sizes and the sampling strategy inside each data set. The method is applied to two DL case studies: 1) open source image classification algorithm and 2) object detection algorithm used in automated driving vehicles (AVs).
|