Abstract:
|
Uncovering the topics over short text corpus has become increasingly important with the development of online communications. However, conventional topic mining methods may fail due to the lack of contexts in each short text. Fortunately, a large proportion of online short texts often co-occur with lengthy texts, such as comments with news articles. These two kinds of texts are hierarchically organized and the hidden topical relationships can be utilized to enhance topic learning for both sides. Therefore, we propose a topic model for (h)ierarchical (d)ocuments, referred as hdLDA, to capture the hierarchical structure of these texts. Specifically, in hdLDA each short text has a probability distribution over two topics, one from a set of topics underlying lengthy texts and the other from a topic set formed only by short texts. We also introduce an online algorithm for hdLDA for efficient topic learning. Extensive experiments on real-world datasets demonstrate that our approach discovers more comprehensive topics for both short texts and lengthy documents, compared with baseline and state-of-art methods.
|