K-means clustering aims to partition a set of n points into k clusters in such a way that each observation belongs to the cluster with the nearest mean, and such that the sum of squared distances from each point to its nearest mean is minimal. In the worst case, this is a hard optimization problem, requiring an exhaustive search over all possible partitions of the data into k clusters in order to find the optimal clustering. At the same time, fast heuristic algorithms for k-means are widely used for data science applications, despite only being guaranteed to converge to local minimizers of the k-means objective. In this talk, we consider a semidefinite programming relaxation of the k-means optimization problem. We discuss two regimes where the SDP provides an algorithm with improved clustering guarantees compared to previous results in the literature: (a) for points drawn from isotropic distributions supported in separated balls, the SDP recovers the globally optimal k-means clustering under mild separation conditions; (b) for points drawn from mixtures of distributions with bounded variance, the SDP solution can be rounded to a clustering which is guaranteed to classify all but a small fraction of the points correctly. An interesting feature about the theoretical tools developed for proving (approximate) optimality of partitions under models (a) and (b) is that they can also be used to a posteriori certify (approximate) optimality of k-means clustering solutions of real data, no model required.
I am Moore-Sloan Research Fellow at the Center for Data Science at New York University. I also have a Collaboration Scientist appointment at the Algorithms and Geometry Simons Collaboration. Before this I was a Research Fellow at a semester-long optimization program at the Simons Institute, UC Berkeley. On May 2017 I finished my PhD in Mathematics at the University of Texas at Austin, I was very fortunate to have Rachel Ward as my advisor. My research is in mathematical data science and optimization. I am also interested in number theory.