Abstract:

Directional data is when the direction of the vector has more relevant information than its magnitude. Text documents can be expressed as directional data by normalizing the frequency of words in each document and this will result in data on a unit hypersphere. Mixtures of von MisesFisher distributions have proven to be an effective model for clustering data on a unit hypersphere, but variable selection for these models remains an important and challenging problem. We derive two variants of the expectationmaximization framework, which are each used to identify a specific type of irrelevant variables for clustering. The first type is noise variables, which are not useful for separating any pairs of clusters. The second type is redundant variables, which may be useful for separating pairs of clusters, but do not enable any additional separation beyond the separability provided by some other variables. Removing these irrelevant variables is shown to improve cluster quality in simulated as well as benchmark datasets.
