Count links to get a No 1 hit

Google's PageRank and Beyond

六月 9, 2006

Google, as virtually any web user knows, is the leading search engine company in the world. How did it reach this position and how did it get there so rapidly? Timing and location were important. The founders, Larry Page and Sergey Brin, were PhD students at Stanford University in the 1990s. The web was expanding quickly, and they were at the heart of Silicon Valley before the dot-com shake-out and when venture capital was still relatively easy to raise.

Existing search engines such as AltaVista, the leading search site at the time, were returning increasing numbers of results for users, but the ordering was not ideal. In practice, most users look at only the first few results in any list given by a search engine, so the ordering is crucial. At the time, this was determined entirely by the contents of individual web pages, so page creators were trying to increase the likelihood of being displayed, for example, by adding multiple keywords to pages. Even though search engines were trying to counter such measures, the search results were increasingly poor and inappropriate as the web expanded.

Page and Brin's insight was to observe that a very important part of the informational structure of the web was the hyperlinks between pages rather than just the contents of individual pages. If a page has many links to it, it is more likely to be an important page and of greater interest to many users than one with few or no links. Of course the rating of the pages linking to a given page should also be taken into account in determining how important the link is in calculating a given page's ranking. With such information, it was possible to order pages in a much more useful way for users, in an almost automatic and self-organising manner.

One beauty of the strategy is that most links have been placed there by a human with knowledge of the web and its resources. Thus the presence of a link indicates in general a level of interest for a page as determined by a person rather than a computer.

Of course, it is necessary to come up with an automated algorithm that is tractable and scalable for the rapidly growing web. According to this book, there are more than 8 billion pages indexed by Google, more than one for every person on the planet. With this in mind, Page and Brin devised the PageRank algorithm, which is the main subject of this book and is still used by Google today. The aim of the algorithm is to assign a ranking for every page indexed by Google, based on the number of the links to the page and the rankings of those linked pages. This is naturally an iterative process and once a month Google updates its assessment of page rankings with a massive computational effort over several days.

The mathematics involved uses matrices and, given the size of the task (8 billion pages and rising), optimisation techniques must be used in the calculations. Fortunately, the matrices involved are relatively sparsely populated because individual web pages link to a small number of other pages only, so the problem remains tractable with enough computational power, which Google can afford.

Other related algorithms and variants of the PageRank algorithm are also covered in the book. The HITS algorithm considers links from a web page as well as links to it. Pages with a high number of links form hubs that are useful for web surfers in finding other resources. This algorithm was used by the Teoma search engine. Teoma recently merged with Ask.com, and it now uses the ExpertRank approach, which identifies subject-specific topic clusters to help improve the ordering. This is not included in the book because it is difficult for a printed text to represent more than a snapshot of such a fast-moving area. Another new approach is TrustRank, which attempts to counter the problem of spam pages being added to the web to improve page rankings artificially through links added for commercial rather than human interest. Again, this more recently developed semi-automated technique (also from Stanford, with Yahoo) is not covered here.

A feature of the book is a number of "asides", short sections of a page or less on related topics. These work well to break up the presentation, especially the mathematical parts, and each can be read on its own if the book is being scanned rather than read cover to cover. There are even shorter boxes that also can mostly be read on their own. These features are certainly thoughtful additions for those without the time or the inclination to read all the material.

The book is a mixture of relatively light but interesting reading on the background to Google's PageRank algorithm and other competing approaches, together with some quite serious technical presentation of the mathematics involved. Fortunately, it is easy for the reader to identify which is which. The lighter material is well presented and will appeal to anyone interested in the web. The mathematical material is for the serious computer scientist who wishes to explore the formal foundations of the algorithms covered. There are also some programming examples included, with source code presented (fortunately of a reasonable length, never more than a page or so).

The problem for this book is the difficulty of satisfying both types of reader. I suspect that the publisher was probably keen to include the more general material to increase the sales of the book. Certainly it presents worthwhile information, but one gets in effect only half a book if the mathematical material is removed. Thus general readers might think twice about purchasing it. For myself, both sides of the story are interesting, but there may be a limited number of others who share my view. Still, I think this is a worthwhile book. It offers a comprehensive and erudite presentation of PageRank and related search-engine algorithms, and it is written in an approachable way, given the mathematical foundations involved.

Jonathan Bowen is professor of computing, London South Bank University, and chairman of Museophile Ltd.

Google's PageRank and Beyond: The Science of Search Engine Rankings

Author - Amy N. Langville and Carl D. Meyer
Publisher - Princeton University Press
Pages - 224
Price - £22.95
ISBN - 0 691 12202 4

请先注册再继续

为何要注册?

  • 注册是免费的,而且十分便捷
  • 注册成功后,您每月可免费阅读3篇文章
  • 订阅我们的邮件
注册
Please 登录 or 注册 to read this article.