Abstract:
|
Clustering is often used as a first look at a dataset in order to determine if there are hard to detect patterns in the dataset. When clustering, the number of clusters and the dimensions analyzed both affect the usability of the information obtained via clustering, but are both generally unknowable without doing heavy analysis of the dataset beforehand. We use a bootstrap approach to simultaneously identify the number of clusters and which dimensions should be included in the most practical clustering scheme for that data. The algorithm iteratively tests between simple and complex models, testing whether the information added by a more complex model is worth the added complexity of that model. The algorithm is used on a dataset of 100 km race runners, a dataset of written digits, and a simulation study is performed to assess the performance in differing situations and in comparison to other similar techniques.
|