A k-Median Algorithm with Running Time Independent of Data Size

作者:Adam Meyerson, Liadan O'Callaghan, Serge Plotkin

摘要

We give a sampling-based algorithm for the k-Median problem, with running time O(k \((\frac{{k^2 }}{ \in } \log k)^2 \) log\((\frac{k}{ \in } \log k)\)), where k is the desired number of clusters and ∈ is a confidence parameter. This is the first k-Median algorithm with fully polynomial running time that is independent of n, the size of the data set. It gives a solution that is, with high probability, an O(1)-approximation, if each cluster in some optimal solution has Ω\((\frac{{n \in }}{k})\) points. We also give weakly-polynomial-time algorithms for this problem and a relaxed version of k-Median in which a small fraction of outliers can be excluded. We give near-matching lower bounds showing that this assumption about cluster size is necessary. We also present a related algorithm for finding a clustering that excludes a small number of outliers.

论文关键词:clustering, sampling, sublinear

论文评审过程:

论文官网地址:https://doi.org/10.1023/B:MACH.0000033115.78247.f0