"No-substitution Online k-means Clustering with Adversarial Order"
Robi Bhattacharjee (UCSD)
Monday, January 11th 2021, 2-3pm
We investigate k-means clustering in the online no-substitution setting when the input arrives in arbitrary order. In this setting, points arrive one after another, and the algorithm is required to instantly decide whether to take the current point as a center before observing the next point. Decisions are irrevocable. The goal is to minimize both the number of centers and the k-means cost. Previous works in this setting assume that the input's order is random, or that the input's aspect ratio is bounded. It is known that if the order is arbitrary and there is no assumption on the input, then any algorithm must take all points as centers. Moreover, assuming a bounded aspect ratio is too restrictive --- it does not include natural input generated from mixture models.
We introduce a new complexity measure that quantifies the difficulty of clustering a dataset arriving in arbitrary order. We design a new random algorithm and prove that if applied on data with complexity d, the algorithm takes O(dk^2 log^2 n) centers, and is an O(k^3)-approximation algorithm. We also show that for "natural" distribution, such as a mixture of Gaussians, the complexity measure, d, is small ~ O(k^2log n). Thus, for a natural distribution, our algorithm takes O(poly(klog n)) centers and is an O(poly(k))-approximation. In terms of negative results, we show that the number of centers needed for an O(1)-approximation algorithm is at least O(d).