Abstract:
|
Historically the field of clustering has been focused on partitioning datasets with little regard for the underlying data generating process. This has created serious unresolved questions regarding the interpretation of clustering results and how algorithms are affected by randomness. We consider the nature of a clustering problem from a statistical perspective, focusing on population level models. From this population based perspective we discuss the difference between classifiers, clusterings and linkage assignments and we propose new indices that put cluster validation into the well known framework of sensitivity and specicity. Dozens of indices have been proposed to compare clusterings but the arguments for selecting one index over another have not been well understood. The framework we propose provides a clear interpretation of the indices and, when tested in a supervised setting, enables researchers to assess the difficulty of their clustering problem. In turn this enables far stronger interpretations of clustering results.
|