Abstract:
|
We propose a new data-driven simultaneous variable selection and clustering method for high-dimensional multinomial regression. Unlike other grouping pursuit methods, for example regression with Graph Laplacian penalty, our method does not assume that moderate to highly correlated variables have similar regression coefficients or should belong to same clusters. Relaxing this assumption is practically meaningful when we have a multinomial response variable. For example, moderate to highly correlated expressed genes may associate with different subtypes of a disease. We propose a penalty function taking both regression coefficients and pairwise correlation into account for defining variables' clusters. An algorithm with respect to this new penalty function is also developed, incorporating both convex optimization and clustering. We demonstrate the performance of our method via a simulation study and compare it with some other methods, showing that our method is able to yield correct variable clustering and to improve prediction performance. A real data example will also be presented.
|