Pointly-supervised scene parsing with uncertainty mixture

摘要

Pointly-supervised learning is an important topic for scene parsing, as dense annotation is extremely expensive and hard to scale. The state-of-the-art method harvests pseudo labels by applying thresholds upon softmax outputs (logits). There are two issues with this practice: (1) Softmax output does not necessarily reflect the confidence of the network output. (2) There is no principled way to decide on the optimal threshold. Tuning thresholds can be time-consuming for deep neural networks. Our method, by contrast, builds upon uncertainty measures instead of logits and is free of threshold tuning. We motivate the method with a large-scale analysis of the distribution of uncertainty measures, using strong models and challenging databases. This analysis leads to the discovery of a statistical phenomenon called uncertainty mixture. Specifically speaking, for each independent category, the distribution of uncertainty measures for unlabeled points is a mixture of two components (certain v.s. uncertain samples). The phenomenon of uncertainty mixture is surprisingly ubiquitous in real-world datasets like PascalContext and ADE20k. Inspired by this discovery, we propose to decompose the distribution of uncertainty measures with a Gamma mixture model, leading to a principled method to harvest reliable pseudo labels. Beyond that, we assume the uncertainty measures for labeled points are always drawn from the certain component. This amounts to a regularized Gamma mixture model. We provide a thorough theoretical analysis of this model, showing that it can be solved with an EM-style algorithm with convergence guarantee. Our method is also empirically successful. On PascalContext and ADE20k, we achieve clear margins over the baseline, notably with no threshold tuning in the pseudo label generation procedure. On the absolute scale, since our method collaborates well with strong baselines, we reach new state-of-the-art performance on both datasets.