How much do you trust Google, Yahoo!, and MSN search?

It's an issue that arises when considering the future of the internet.

Imagine the internet without search engines. Just you and the billions of pages out there at the end of your connection. How do you utilise those pages? The short answer is: you can't. Search is so ingrained in the way the internet is used and, consequently, in the way it has developed in recent years, that without search capabilities much of the functionality of the internet would be lost. Now consider another possibility: you have the ability to search the internet, but there is only one search provider available - a monopoly has been created. How much power would that search engine have, and how would that power be used?

Currently we have the big three in search - Google, Yahoo!, and MSN - and their competition engenders confidence that when we search we can find results which are not unfairly weighted by the search engines themselves. There are rankings and weight is given to certain results (we are talking about search engines after all!) but these are applied to improve the relevancy of your search query result and, crucially, we can be fairly certain that the results given are the product of a search engine algorithm applied to each and every one of those results.

We can therefore be sure that, to paraphrase Orwell, we are not in the situation where all search results are equal, but some results are more equal than others. Competition helps us because if unfair results were routinely returned by one of the big three then it would be quickly picked up by users and, almost certainly, result in reduced usage of the offending search engine and a nightmare scenario for their PR department.

As long as competition remains, then, we can be happy with the search systems available to us; systems that allow us to use the internet as a tool and viable marketplace. But how long will competition remain - how long before the monopoly arrives? Maybe never, but it is with consideration of the search engine monopoly scenario that Alex Chudnovsky set up Majestic 12 - a search engine which uses distributed computing - as used by SETI@home - to provide, what he feels could be, a viable alternative to today's search engine giants.

The early success of the project is apparent: an alpha version of the Majestic 12 search engine was released on the 3rd of June 2005, and this month saw the number of indexed pages reach 1 billion while the number of spidered pages sits at around 8 billion.

Most recently of Jungle.com, Mr Chudnovsky explicitly states his concern for the single search engine problem, believing that Google is already in that position: "Because of their [Google's] success, they have effectively created a monopoly in the virtual world. Monopolies never end up well for consumers."

Mr Chudnovsky originally specialised in the "research and development of innovative software designed to assist IT and business professionals in maximising revenues from their web sites", and has driven the development of Majestic 12 from concept to implementation. In response to the question of why, Mr Chudnovsky has said:

"Because we can - personal computers and connections are reaching levels where massive projects of this scale are possible. Come to think of it: just 1 computer on 512k broadband can crawl 0.5mln pages a day, so having just 8,000 participants in the network would result in daily crawl of as much data as Google itself has in its database! And that can be achieved in just one day! We therefore can definitely beat Google, at least in terms of up-to-dateness and depth (size) of web database."

The idea of a distributed network providing search engine functionality is not unique of course. Grub was a well known attempt to provide just such functionality, though it is now defunct having provided an object lesson in the difficulties associated with providing relevant search engine results.

The Grub system worked as a volunteer project, with participants allowing the Grub screensaver to run while their computer ran idle - enabling the search for, and indexing of, new pages for the Grub index. Although no longer operating, the owners of Grub, LookSmart Ltd, consider the project a success as the "...extended trial period proved the concept and the underlying robustness of the [distributed computing] platform." Where the system fell down was in the difficulty in providing relevant results. From the Wikipedia page on Grub:

"Many state that a large cache is not the strength of a good search engine, rather, that it is the ability to deliver accurate, precise results to users."

Presumably Mr Chudnovsky has addressed this problem in the algorithms used for Majestic 12. If not, then perhaps the board of directors behind Nutch have.

With Tim O'Reilly and Peter Savich on that board hopes are high that Nutch, while still very early in development, will prove to be a solid entrant into the distributed search market. A Californian, non-profit corporation run for the benefit of the public, Nutch aims to eventually be: "providing free high-quality search software and its source code to the public and facilitating ongoing research and development of search technology in a public forum". The board of directors also state that "Nutch is primarily a software project, not a service," so we shouldn't get too excited about a rival for Google just yet.

So the distributed search engine concept is well known, but what of the vulnerabilities to such a system. As Google, Yahoo!, and MSN well know, there are a great many unscrupulous organic search engine optimisers and spammers who eschew ethical optimisation practices and try to influence the results of search engine result pages. These practices could become a serious hindrance to unbiased results with a system utilising an algorithm which is visible to all, as the distributed computing systems' algorithms have to be. Imagine a search engine where, because the source code of the search algorithm has been 'cracked', the index has been filled with content entirely different from that of the actual pages.

Questions are also raised as to the ability of these distributed search engines to return results from a database as large as Google's - should they ever succeed in reaching such a size - in a reasonable time. The very nature of distributed computing necessitates many more steps to serve a result to the user than an equivalent search on a dedicated server system such as Google's (which is so large speculation ranges wildly as to the precise number of servers used, with citations ranging from 10-15,000 to more than 70,000). Given that Majestic 12 has managed to spider around 8 billion pages, yet has only managed to index 1 billion, questions pertaining to the speed of the service as it grows - and how it will keep up as the index grows - are becoming increasingly serious and threaten to undermine the whole concept of distributed computing.

However, with an agenda to "count every link and every page on the Internet", it's hard not to admire the scope of Majestic 12. Also, given the importance of competition to the validity of search results and, therefore, internet usability, it's also hard not to applaud Majestic 12 and Nutch's attempts to shake up the status quo.

For more information feel free to follow these links,

Majestic 12:
http://www.majestic12.co.uk / about.php
Nutch:
http://nutch.sourceforge.net / docs / en / about.html
Grub:
http://grub.looksmart.com /
  • Print this page
  • Send this page to a friend
  • Digg this article
  • Post this article to Reddit
  • Bookmark this article in Del.icio.us
  • Add this article to Sphinn
  • Add this article to Furl
  • Add this article to Magnolia
  • Add this article to StumbleUpon
  • Bookmark this article in Google
bigmouthmedia - 3 parts optimization, 1 part listings magic
© bigmouthmedia 2008