Search Engines Overview

There are basically three types of search engines.

Search Scripts are written generally in Perl or C searching sites of a maximum a few thousand pages. They do not include complex algorithms to optimize searching and indexing and become unusable as the site grows over a few thousand pages. Such scripts are still good when it comes to a home-made site, where a search box is more of a gadget.

Technically, those scripts index modified pages once in a while (like once a day) in a few minutes or seconds and produce usually fixed output with little layout options as search results. As those scripts are entirely CGIs, they are slower as they never maintain persistent data in memory but store it and load or parse it each time the search engine is invoked.

Among such scripts you can find WebGlimpse (http://www.webglimpse.net/ ) or ht://Dig (http://www.htdig.org/ ).

Search Servers, ISAPI or WAI applications, sometimes mixed with CGI scripts overcome this drawback of indexes constantly re-read from the hard disk. Working as a permanently running server and answering to multiple search requests simultaneously this category of search engines requires more hardware power, more memory and is aimed to larger sites that really need a good search engine. Alkaline is designed as a server persistent search engine.

Technically, search servers maintain indexes in RAM or use some internal swap mechanism. They have complex algorithms for searching and indexing, usually jealously kept secret by their designers. Alkaline uses the concept of "cellular expansion" which gives quite an interesting performance and opens doors for future research. Cells are fast and resistant to growing data. Of course, there's no mystery that a big server with a lot of hardware power will search faster and will be able to index a larger site. Existing Alkaline powered sites maintain an index of 500'000 pages with about 450'000 word forms and run on industry average Pentium III or Sun Ultra servers. Such a configuration can handle from two to three search requests per second.

Among such servers you can of course find Alkaline, but also Infoseek Ultraseek (http://www.ultraseek.com/ or Thunderstone Webinator (http://www.thunderstone.com/webinator/ ).

Finally, Distributed Servers target searching and indexing of the whole web. This is the most fierce long term fight of search engines as large companies compete for the best technology and for the most relevant search results. We plan a parallel implementation of Alkaline for a cluster over a TCP/IP platform independent network and for IBM SP2. We have already made numerous tests over a PVM network. For our distributed architecture we want Alkaline to index 5-10 million pages running fast on a cluster of 32 PII PCs. Unlike Altavista we do not plan to set search limits to Alkaline depending on the price, that is we will distribute it as one single product for the same value no matter what you search. Choosing Alkaline, you will also choose a team that works for the future.

Technically, distributed search servers perform both parallel indexing and searching. More hardware power you have, faster indexing and searching is. Of course, this depends on the network charge overhead. All major search engines use distributed architectures and can hit hundreds of requests per second.