Alkaline has been intensively used with sites that are dynamically generated or contain cgi scripts. Several things should be done or taken in consideration when configuring Alkaline in order to avoid frequent questions or problems.
Usually, cgi requests are made on scripts that might have a different extension, for example somecgi.exe has a .exe extension or somecgi.pl, has a .pl one. These extensions must be added to the configuration with ExtsAdd=exe or ExtsAdd=pl or both, ExtsAdd=pl,exe .
By definition, a cgi request is identified by a trailing list of parameters after the ? character, for example /cgi-bin/foo.pl?name=value . With a default configuration, Alkaline ignores such requests. You must add Cgi=Y to the configuration file.
Dynamically generate pages, except for Active Server Pages (.asp) often do not output a Content-Length field and never comply to If-Modified-Since headers. It might be judicious to add Expire=Y (or run Alkaline with -expire ) in order to index all pages and avoid Alkaline spending time finding the last modified date for a document.
The cgi parameters are often case-insensitive, especially for .exe scripts under Windows NT. You might want to consider urls case-insensitive. Add Insens=Y to the configuration file.
A lot of cgi scripts generate dynamic data as a result of an html form post. Alkaline cannot simulate a post method neither fill a form for the user. You might want to add some links manually or using an <alkaline url=...> tag.
Background indexing is one of the most powerful Alkaline features. It allows to continuously index a site and make changes available to the search engine instantly. Some items should though be considered very seriously especially for heavy traffic sites and sites indexing large amounts of data.
Background indexing is disabled by setting Reindex=N in all configuration files or by running Alkaline with --noreindex option. The later is the preferred way as no background thread is created when --noreindex is specified on the command line.
A heavy traffic search site would have over a request every two seconds. Alkaline is known to handle 3-5 requests a second depending on the hardware configuration.
Everything in Alkaline is done to favor search speed and not background indexing. This includes regular checkups of search activity from the background thread in order to pause the later as soon as possible. Still, the background indexing thread and the search thread manipulate the same data and both will lock each other depending on the access needs.
Enabling background indexing implies at least 10-25% more memory usage and a much higher CPU activity, reaching 70% of CPU average usage compared to 3-5% without the background indexing thread.
Because of interlocked architecture of Alkaline, enabling background indexing means degrading search performance. Search performance will degrade by at least 25% during normal background reindexing. Search operations will degrade by up to 95% when writing indexes, thus background indexing is not advised on sites with over 50'000 documents.
If you enable background indexing on a huge site, be sure to use SleepFile and SleepRoundtrip parameters as they can increase performance of the search front-end dramatically.
Alkaline is capable of serving thousands of requests per minute, that is many requests a second. Since Alkaline is a full HTTP server and pools requests, it very unlikely that responsiveness to the clients would be an issue. Disabling the background indexing is the first thing to do when speed becomes a question or when you are having trouble keeping the engine stable.
Alkaline fully supports indexing and searching of Lotus Notes Domino sites. Indexing Lotus Notes Domino generated sites requires several additional options to be added to the configuration files.
Domino requests might look like a simple demand to a .nsf file, adding AddExts=nsf to the configuration files will tell Alkaline to index documents with .nsf extension. All Domino requests look like cgi requests, so it is necessary to add Cgi=Y .
Moreover, Domino can generate multiple views for same content. Alkaline includes an expansion feature that will transform any page with collapsed elements into a request to it's expanded form. To enable this, add Nsf=Y to the configuration. The Nsf option will also enable lookup of full duplicates, that is pages that have different urls but the same content. Domino generates urls in all possible forms and shapes, often leading to the same content.
Domino is not case-sensitive for urls, but all UNIX servers are. Alkaline will treat /Foo and /foo as two different links. Thus Insens=Y must be added. Domino might generate empty links leading to pages with junk (which is probably a bug). Such a link looks like <a href=url> and is neither clickable, nor has any text. Alkaline will still follow such links unless you specify EmptyLinks=N .
Here's an example of a configuration file for the Geneva Hospital Domino web server:
Remote=N UrlList=http://www.hug-ge.ch Depth=-1 MaxFiles=-1 SleepRoundtrip=21600 CGI=Y AddExts=nsf Insens=Y NSF=Y NoEmptyLinks=Y |
It is possible to combine Alkaline filter features and the Alkaline spider to mirror an entire set of web sites.
Create the following url2path perl script and ensure that you can run it on your server. This script transforms an url into a fully qualitified path and issues the required commands to copy a file in the newly created directory hiararchy. This scrpt has been tested on a Linux server.
#!/usr/bin/perl
use Carp;
use URI;
@ARGV == 3 or Carp::croak 'usage: url2path [url] [source] [target]';
my $url = URI->new($ARGV[0]);
my @path = $url->path_segments;
my $relative = $url->host;
for ($i = 0; $i < $#path; $i++)
{
my $segment = $path[$i];
$relative = $relative.$segment.'/';
}
my $prefix = $ARGV[2];
my $fullpath = $prefix.$relative;
my $fullfile = $fullpath.$path[$#path];
if (length($path[$#path]) == 0)
{
$fullfile = $fullfile."index.html";
}
print "url2path: saving ".$fullfile."\n";
system("mkdir -p ".$fullpath);
system("cp ".$ARGV[1]." ".$fullfile);
|
Use a regular asearch.cnf to invoke the script for each newly downloaded document using the Filter directive, similar to this one:
UrlList=http://www.foo.com/ SkipText=Y SkipMeta=Y Filter=/usr/bin/url2path.pl $3 $1 /home/mirrors/ ; mv $1 $2 ExtsAdd=jpg,gif |
Alkaline supports indexing of local file system files since version 1.7. To index a particular directory and all its subdirectories, use the file:// format for your urls. For example:
# UNIX asearch.cnf UrlList=file:///home/user/ |
# Windows asearch.cnf UrlList=file://d:\documents\web\ |
Alkaline will treat directory listings as a document and index it according to normal rules. All configuration options valid for http:// urls fully apply with file:// urls, this includes extensions to be indexed and links to follow.
Search results returned are of the file:// url format as well. To index a web site that is stored locally, use the ReplaceLocal directive and an http:// format url. To index web content that is not linked to each-other, use the file:// url format and the Replace directive to render search results.