Keywords: Active labeling, Finite mixture models, classification, Generalized hyperbolic distribution
To classify textual customer response data into relevant categories, we create an algorithm that combines active learning and model-based classification using a mixture of generalized hyperbolic distributions (GHD). Active learning refers to machine learning algorithms that are able to request examples from the user, which is particularly efficient in terms of speed when the dataset is large. Model-based classification assumes that the population is a convex combination of sub-populations, where each has its own density function. The GHD has the advantage of being very flexible that it can classify the data even if the groups are skewed or have outliers. Therefore, it makes a good fit to represent real-life phenomena compared to other popular distributions, such as the bell-shaped, symmetric Gaussian or Student's t. The algorithm begins with feature extraction, where customer responses are converted to a numeric data matrix, using word embeddings implemented with GloVe and spaCy. If no labels are given, our algorithm will first conduct cluster analysis using mixture of GHDs to determine a plausible data organization, then consults experts showing them observations around each cluster mode. This is the key step to determine the label for each group. Once the main structure is determined, the algorithm refines the boundary between groups by showing observations in the overlap area between groups. Using the new labels, it performs semi-supervised classifications using the Mixture of GHDs. For overlap detection, we mainly rely on each observation’s mapping probabilities given by the current classifier or Silhouette widths. The algorithm also has the option to start with given labels if partial labels are available. The final output is a classification of every document. We use both real and simulated data to evaluate the algorithm’s correctness and its efficiency compared to existing models, such as support vector machine, k-nearest neighbors, and random forest.