Chapter 7. Advanced Alkaline Features

Table of Contents
Indexing Other Document Formats
Available Filters
Alkaline Robots, HTML and Meta Tags
Online Administration and Statistics
Running Alkaline as a Windows NT/2000 Service
Alkaline Virtual Memory and Swap
Indexing Guidelines
Working With Us

Indexing Other Document Formats

Introduction

Alkaline features include indexing special document formats, such as Adobe PDF or Microsoft Word. To index document formats other than HTML, a filter is required.

Alkaline has the ability to preprocess any data retrieved before it is indexed. A document of any format can be passed to any external piece of software, called a filter, transformed by this filter and indexed. Alkaline can perform various tasks using filters. It can obviously index a site with documents of a different format. But as a filter can be invoked on any indexed document, Alkaline offers the unique possibility of implementing such features as mirroring sites or gathering site statistics and information.

Both document and object filters take the contents of a temporary file created by Alkaline, process such contents to produce html output into a second temporary file, read by Alkaline.

It is necessary to instruct Alkaline to retrieve a document of a different type. This is done by using the ExtsAdd directive in the asearch.cnf file. For example:
ExtsAdd=pdf,doc

There're two sort of Filters in Alkaline: document filters and object filters . Document filters process documents directly linked from HTML pages. Object filters process embedded objects.
Simple PDF document:
<a href="docs/document.pdf">pdf document</a>

Embedded Shockwave Flash object:
<object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"
 codeBase="http://active.macromedia.com/flash2/cabs/swflash.cab#version=3,0,0,0"
 height="100%" id="navig" width="100%">
  <param name="movie" value="navig.swf">
  <param name="loop" value="false">
  <param name="quality" value="autohigh">
  <param name="menu" value="false">
</object>

Document Filters

Document filters are defined in the asearch.cnf files by:
Filter Extension/Mime Type=command line

It is possible to preprocess all documents by omitting the Extension/Mime Type parameter, for example:
Filter=/bin/filter $1 $2

Some documents are returned without a mime type or have no extension, for example http://www.server.com/ does not imply any extension and might be an HTTP/0.9 compliant server returning HTML contents without the Content-type header. Specifying a filter with no type will catch all these special cases.

To index pdf documents, you must tell Alkaline to retrieve pdf files by adding ExtsAdd=pdf to the asearch.cnf file. An Adobe pdf filter would be used like this (notice that the extension is case-sensitive):
Filter PDF=/bin/pdftotext $1 $2
Filter pdf=/bin/pdftotext $1 $2

Specifying a case-sensitive extension is not very convenient; you can also specify mime types, for example:
Filter Application/Zip=/bin/specialunzip $1 $2
Filter Application/Pdf=/bin/pdftotext $1 $2

The variables such as $1, $2 are used to pass parameters to the filter. Available variables are:

Table 7-1. Document Filter Automatic Variables
$0 filename extension without the leading dot (ex: pdf)
$1 temporary file name that contains data remotely retrieved
$2 temporary file name that should contain results generated by the filter
$3 url of the file retrieved, not quoted
$4 data retrieved (use a temporary file, $1, instead)
$5 mime type of the file retrieved if any (such as application/zip), not quoted

Object Filters

Alkaline will process objects of the following format found in the retrieved HTML documents:
<object classid="clsid:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx">
 <param name="name1" value="value1">
 <param name="name2" value="value2">
 ...
</object>
Such objects are embedded in the document. After document filter processing, the resulting output of the filter will be embedded into the document as well.

To make a filter work properly for an embedded object, the following must be included into the asearch.cnf file:
# command line to execute for each object of type ClassID
Object ClassID=command line
# param value to use to retrieve the document
ObjectDocument ClassID=parameter name

Here is a real example for filtering Shockwave Flash embedded objects. The shockwave flash CLSID (unique class ID) is clsid:D27CDB6E-AE6D-11cf-96B8-444553540000 and the embedded object at http://www.foo.com/bar/ looks like this:
<object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"
 codeBase="http://active.macromedia.com/flash2/cabs/swflash.cab#version=3,0,0,0"
 height="100%" id="navig" width="100%">
  <param name="movie" value="foo.swf">
  <param name="loop" value="false">
  <param name="quality" value="autohigh">
  <param name="menu" value="false">
</object>

The document defined for the Shockwave Flash object is in the variable movie, thus the following should be added to asearch.cnf:
ObjectDocument clsid:D27CDB6E-AE6D-11cf-96B8-444553540000=movie

This will tell Alkaline to retrieve http://www.foo.com/bar/foo.swf as defined by the movie variable in the object with the CLSID "clsid:D27CDB6E-AE6D-11cf-96B8-444553540000".

For the filter to be executed, it is necessary to define a valid command line. Mapping for the command line for objects is more complete than for document filters. All variables defined by the param tags are available in addition of

Table 7-2. Object Filter Automatic Variables
sourcefile filename that contains the retrieved document
targetfile filename that should contain the filter results
url URL where the OBJECT tag was found
base the BASE HREF of the document where the OBJECT tag was found

For example:
# place on one single line
Object clsid:D27CDB6E-AE6D-11cf-96B8-444553540000=/usr/local/bin/swf-filter
 $sourcefile --menu="$menu" > $targetfile

The following output was produced by Alkaline when an object filter is invoked:
[http://www.foo.com/bar/] (-1) - [639 bytes][0]
 [clsid:D27CDB6E-AE6D-11cf-96B8-444553540000]
 [http://www.foo.com/bar/foo.swf][200 OK][510879 bytes]
 [inf][lnx][md5][vix][keys][mta][ndx][ok]

Writing Filters

An Alkaline filter is a simple command line program that takes at least two arguments in any order or format: a file name of the original document and a file name of the output. The output should be text or (partial) HTML. Your filter can generate TITLE and META tags. Punctuation and formatting output by the filter are ignored by Alkaline.

With Alkaline, you can specify a chain of filters or any kind of command processing for your filter files. You can also make a script that will choose whether to process a file or not. In the case when the filter(s) should not translate the file, simply output the exact copy of the original document.

If you plan to write a new filter, make sure you visit http://www.wotsit.org . It is definitely the best source for document formats and available technical resources and format specifications. As you test or write a new filter, please email admin@vestris.com with a detailed description, examples, source and/or binary availability, licensing information, and all other useful links and comments.