JSM 2016 Online Program

Activity Number:	540
Type:	Contributed
Date/Time:	Wednesday, August 3, 2016 : 10:30 AM to 12:20 PM
Sponsor:	Section on Statistical Computing
Abstract #320515	View Presentation
Title:	Dimensionality Reduction for Clustering Data on a Unit Hypersphere with Application to Text Mining
Author(s):	Semhar Michael* and Volodymyr Melnykov
Companies:	South Dakota State University and University of Alabama
Keywords:	Model-based clustering ; Finite mixture modeling ; Text document clustering ; Variable selection ; directional data ; EM algorithm
Abstract:	Model-based clusering via finite mixtures of von Mises-Fisher distributions is commonly used to group observations lying on a unit hypersphere. One popular application is to find groups in a given pool of text documents. Text documents can be expressed as directional data by normalizing the frequency of words in each document. In multivariate data analysis, it is well known that presence of noise variables in data degrades clustering performance. Text documents usually contain many words which do not provide any useful information for clustering. We develop a procedure to identify and remove noise and redundant words along with clustering of documents. The performance of the developed procedure is good as illustrated through synthetic and real-life data.

Authors who are presenting talks have a * after their name.