A case-study on naïve labelling for the nearest mean and the linear discriminant classifiers

作者：

Highlights：

•

摘要

The abundance of unlabelled data alongside limited labelled data has provoked significant interest in semi-supervised learning methods. “Naïve labelling” refers to the following simple strategy for using unlabelled data in on-line classification. A new data point is first labelled by the current classifier and then added to the training set together with the assigned label. The classifier is updated before seeing the subsequent data point. Although the danger of a run-away classifier is obvious, versions of naïve labelling pervade in on-line adaptive learning. We study the asymptotic behaviour of naïve labelling in the case of two Gaussian classes and one variable. The analysis shows that if the classifier model assumes correctly the underlying distribution of the problem, naïve labelling will drive the parameters of the classifier towards their optimal values. However, if the model is not guessed correctly, the benefits are outweighed by the instability of the labelling strategy (run-away behaviour of the classifier). The results are based on exact calculations of the point of convergence, simulations, and experiments with 25 real data sets. The findings in our study are consistent with concerns about general use of unlabelled data, flagged up in the recent literature.

论文关键词：Semi-supervised learning,Unlabelled data,On-line classifiers,Naïve labelling

论文评审过程：Received 2 August 2007, Revised 28 November 2007, Accepted 25 March 2008, Available online 7 April 2008.

论文官网地址：https://doi.org/10.1016/j.patcog.2008.03.028