Conditional information gain networks as sparse mixture of experts

摘要

Deep neural network models owe their representational power and high performance in classification tasks to the high number of learnable parameters. Running deep neural network models in limited-resource environments is a problematic task. Models employing conditional computing aim to reduce the computational burden while retaining model performance on par with more complex neural network models. This paper, proposes a new model, Conditional Information Gain Networks as Sparse Mixture of Experts (sMoE-CIGNs). A CIGN model is a neural tree that allows conditionally skipping parts of the tree based on routing mechanisms inserted into the architecture. These routing mechanisms are based on differentiable Information Gain objectives. CIGN groups semantically similar samples in the leaves, enabling simpler classifiers to focus on differentiating between similar classes. This lets the CIGN model attain high classification performances with lighter models. We further improve the basic CIGN model by proposing a sparse mixture of experts model for difficult to classify samples that may get routed to suboptimal branches. If a sample has routing confidence higher than a specific threshold, the sample may be routed towards multiple child nodes. The classification decision can then be taken as a mixture of these expert decisions. We learn the optimal routing thresholds by Bayesian Optimization over a validation set by minimizing a weighted loss, including the classification accuracy and the number of multiplication and accumulations (MAC). We show the effectiveness of the CIGN models enhanced with the Sparse Mixture of Experts approach with extensive tests on MNIST, Fashion MNIST, CIFAR 100 and UCI-USPS datasets, as well as comparisons with methods from the literature. sMoE-CIGN models can retain high generalization performance, on par with a thick unconditional model while keeping the operation burden at the same level with a much thinner model.1