Abstract:
|
Most extensions of standard PCA to exponential family data are based on the assumption that the natural parameter matrix can be factorized into two low-rank matrices, namely, the principal component loadings matrix and scores matrix. The quality of component scores is of great importance for downstream tasks such as clustering and regression. When both loadings and scores are treated as fixed and unknown, they are often estimated jointly through the maximum likelihood. However, the joint estimation tends to inflate component scores in the magnitude and degrade the quality of scores when the data dimension is fixed. One possible source of this inflation is related to the bias of MLE in generalized linear model. We examine the extent of bias in component scores for logistic PCA with binary data. Through simulation studies we evaluate the effectiveness of some existing methods for bias reduction in MLE for logistic regression when the loadings are treated as known or estimated first from training data. In addition, we compare the quality of component scores from the joint estimation with an alternative formulation of logistic PCA through the projection of saturated logit parameters.
|