Density-Based Cluster Analysis
- Ingo Steinwart (University of Stuttgart)
Abstract
A central, initial task in data science is cluster analysis, where the goal is to find clusters in unlabeled data. One widely accepted definition of clusters has its roots in a paper by Carmichael et al., where clusters are described to be densely populated areas in the input space that are separated by less populated areas. The mathematical translation of this idea usually assumes that the data is generated by some unknown probability measure that has a density with respect to the Lebesgue measure. Given a threshold level, the clusters are then defined to be the connected components of the density level set. However, choosing this threshold and possible width parameters of a density estimator, which is left to the user, is a notoriously difficult problem, typically only addressed by heuristics.
In this talk, I show how a simple algorithm based on a density estimator can find the smallest level for which there are more than one connected component in the level set. Unlike other cluster algorithms this approach is fully adaptive in the sense that it does not require the user to guess crucial hyper-parameters. Finally, I present some numerical illustrations.