Distributed clustering of categorical data using the information bottleneck framework

作者:

Highlights:

• We implement existing Information Bottleneck based clustering algorithms and explore their limitations on a single machine, with large inputs.

• We propose two new implementations that manage larger datasets in multiple machine setups without compromising the overall clustering quality.

• We provide theoretical foundations for the proposed implementations based on representative sampling and an efficient merging strategy.

• We evaluate the performance of these implementations both with real and synthetic datasets.

摘要

•We implement existing Information Bottleneck based clustering algorithms and explore their limitations on a single machine, with large inputs.•We propose two new implementations that manage larger datasets in multiple machine setups without compromising the overall clustering quality.•We provide theoretical foundations for the proposed implementations based on representative sampling and an efficient merging strategy.•We evaluate the performance of these implementations both with real and synthetic datasets.

论文关键词:Distributed clustering,Categorical data,Information Bottleneck

论文评审过程:Received 24 December 2016, Revised 12 October 2017, Accepted 15 October 2017, Available online 18 October 2017, Version of Record 1 November 2017.

论文官网地址:https://doi.org/10.1016/j.is.2017.10.006