2008年6月2日 星期一

The PageRank Citation Ranking: Bringing Order to the Web

This paper introduces the PageRank algorithm, which is one of the central ranking estimators employed in the Google search engine. The idea of PageRank is quite simple: if one page is cited (linked) by a important page, it should likely be important as well. By putting more important pages in front of less important ones in the result list, we can effectively improve the search quality. Although I don't actually know how they combine different estimators in Google, PageRank seems to be the essential ones.

 

Although the definition of PageRank is recursive, given a network it can be computed via naive iterations when we view the pages and links as nodes and edges respectively. First, we randomly initialize the value of each page, and second, we update them by propagating those values through the links on the page. The iteration is ran to the convergence. If we interpret the normalized values of pages as probabilities, the thing that PageRank do is equivalent to computing the long term probabilities of a random walk markov chain. As argued in the paper, simple iterations converge quickly on common networks as their geometries permit.

 

The author further introduced a method to personalize the search engine. Using the thought of transition probability matrices in random walk, we could blend each row with some constant vectors that characterize personal preferences. The experiments on real data did support this idea. The other applications shown in the paper are also interesting.

沒有留言: