Modeling Hidden Topics on Document Manifold
Topic modeling has been a key problem for document analysis.
One of the canonical approaches for topic modeling is Probabilistic
Latent Semantic Indexing, which maximizes the joint probability
of documents and terms in the corpus. The major disadvantage of
PLSI is that it estimates the probability distribution of each document
on the hidden topics independently and the number of parameters
in the model grows linearly with the size of the corpus, which
leads to serious problems with overfitting. Latent Dirichlet Allocation
(LDA) is proposed to overcome this problem by treating the
probability distribution of each document over topics as a hidden
random variable. Both of these two methods discover the hidden
topics in the Euclidean space. However, there is no convincing evidence
that the document space is Euclidean, or flat. Therefore, it is
more natural and reasonable to assume that the document space is
a manifold, either linear or nonlinear. In this paper, we consider the
problem of topic modeling on intrinsic document manifold. Specifically,
we propose a novel algorithm called Laplacian Probabilistic
Latent Semantic Indexing (LapPLSI) for topic modeling. LapPLSI
models the document space as a submanifold embedded in the ambient
space and directly performs the topic modeling on this document
manifold in question. We compare the proposed LapPLSI
approach with PLSI and LDA on three text data sets. Experimental
results show that LapPLSI provides better representation in the
sense of semantic structure.
Date: October 15, 2008
Book Title: 2008 ACM Conf. on Information and Knowledge Management (CIKM'08), Napa Valley, CA
Type: Article
Address: Napa Valley, CA, USA
Downloads: 239
Has 1 soft copy
remote linkBibtex
@Article{Modeling_Hidden_Topics_on_Document_Manif,
author = "Deng Cai and Qiaozhu Mei and ChengXiang Zhai",
title = "{Modeling Hidden Topics on Document Manifold}",
month = "October",
year = "2008",
address = "Napa Valley, CA, USA",
journal = "2008 ACM Conf. on Information and Knowledge Management (CIKM'08), Napa Valley, CA",
}