Active learning using a self-correcting neural network (ALSCN)

作者:Velibor Ilić, Jovan Tadić

摘要

Data labeling represents a major obstacle in the development of new models because the performance of machine learning models directly depends on the quality of the datasets used to train these models and labeling requires substantial manual effort. Labeling the entire dataset is not always necessary, and not every item from the image dataset contributes equally to the training process. Active learning or guided labeling is one of the attempts to automate and speed up labeling as much as possible. In this study we present a novel active learning algorithm (ALSCN) that contains two networks, convolutional neural network and self-correcting neural network (SCN). The convolutional network is trained using only manually labeled data, and after training that network it predicts labels for unlabeled items. The SCN network is trained with all available items, some of those items are manually labeled and remaining items are automatically labeled with previous network. After training SCN network, it predicts new labels for all available items, and the new labels are compared with the labels used for training. Items in which differences have been identified are selected for manual labeling and then added to dataset of previously manually labeled items. After that, the convolutional network is trained with extended dataset and previously described steps are repeated. Our experiments show that the network trained using items selected by the proposed method exceeds the performance of a network trained with the same number of items randomly selected from the set of available items. Items from the complete datasets are selected in several iterations, and used for training the models. The accuracy of the models trained with selected items matched or exceeded the accuracy of models trained with the entire dataset, which shows the extent of reduction in the required manual labeling effort. The efficiency of presented algorithm is tested on three datasets (MNIST, Fashion MNIST and CIFAR-10). The final results show that manual labeling is required for only 6.11% (3667/60,000), 23.92% (14,353/60,000) and 59.4% (29,704/50,000) items, in case of MNIST, Fashion MNIST and CIFAR-10 dataset, respectively.

论文关键词:Active learning, Machine learning, Convolutional neural networks (CNN), Dataset labeling

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-021-02515-y