2008年3月24日 星期一

[Paper Review] Probabilistic Latent Semantic Indexing

The author proposed a probabilistic formulation of latent semantic model on count data. Assume now we have a corpus of text documents, the formulation consists of a set of factors, each is a combination of word-to-factor probabilities, document-to-factor probabilities and mixing proportions of that factor. Given the number of desired factors, the whole model is optimized using tempered expectation minimization(TEM) with some held-out data to avoid over-fitting. The resulting mixture model can then be used in various information retrieval problems.


The most important value I see here is the ability to extend the model with smaller models by linear combination of individual probability estimates. This directly avoids the need to proceed an entire optimization for the new model as in the case of LSA, which is certainly time-consuming. Another benefit is that one may not be required to know the optimal number of factors prior to applying the algorithm. Furthermore, the extended model would gain additional robustness to noise due to the model averaging. In this aspect, I think pLSA is a solid improvement of LSA.


To validate the usefulness of pLSA, the author also conducted several experiments on document retrieval. Although the performance is indeed superior to the original LSA and term-matching methods, I have to say I did not see great improvements here. In most cases, the pLSA variants go ahead of LSA only by a small margin (less than 8 percent) and the the use of idf seems to be equally important. I doubt if there still exists some undiscovered significant properties of pLSA in this paper.


Another shortcoming of pLSA that I found on the Wikipedia is the problem of over-fitting. This is somehow reasonable as pLSA is non-deterministic(to the choice of held-out data) and the solution space it explores is highly non-linear. Whether there exists good solution to this problem then requires further reading.

沒有留言: