Abstract:
|
In this paper we extend the idea of spectral clustering using probabilistic PCA (PPCA) to cluster panel data. The key challenge is to determine the true number of clusters. A number of solutions assume a factor analysis model when both the observation and a factor matrix are observed and the loading matrix (W) estimated. However, when only the observations are available, a latent "phantom" random vector could be used to account for the clustering structure. Within the wide ranging assumption of small number of clusters relative to sample size, a penalized form of the PPCA is implemented to directly maximize the number of clusters (p). We show theoretically that the penalized MLE p0 is consistent for reasonable choices of the penalty parameter. This approach resembles the shrinkage estimation since the last N-p0 singular values of the estimated W are shrunk to zero. We demonstrate with data from Google Domestic Trend searches that search terms that are assigned to the same cluster are conceptually consistent, and a a visual inspection of the raw data overlaid confirms the shared similarity of trends over time.
|