Alkaline features include indexing special document formats, such as Adobe PDF or Microsoft Word.
To index document formats other than HTML, a filter
is required.
Alkaline has the ability to preprocess any data retrieved before it is indexed.
A document of any format can be passed to any external piece of software,
called a filter, transformed by this filter and indexed. Alkaline can perform various tasks using filters.
It can obviously index a site with documents of a different format. But as a filter can be invoked on any indexed
document, Alkaline offers the unique possibility of implementing such features as
mirroring sites
or gathering site statistics and information.
Both document and object filters take the contents of a temporary file created by Alkaline, process such contents
to produce html output into a second temporary file, read by Alkaline.
It is necessary to instruct Alkaline to retrieve a document of a different type.
This is done by using the ExtsAdd
directive in the asearch.cnf file.
For example:
There're two sort of Filters in Alkaline: document filters
and object filters
.
Document filters process documents directly linked from HTML pages. Object filters process embedded objects.
Simple PDF document:
<a href="docs/document.pdf">pdf document</a>
Embedded Shockwave Flash object:
<object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"
codeBase="http://active.macromedia.com/flash2/cabs/swflash.cab#version=3,0,0,0"
height="100%" id="navig" width="100%">
<param name="movie" value="navig.swf">
<param name="loop" value="false">
<param name="quality" value="autohigh">
<param name="menu" value="false">
</object>
|
Document filters are defined in the asearch.cnf files by:
Filter Extension/Mime Type=command line
|
It is possible to preprocess all documents by omitting the Extension/Mime Type parameter, for example:
|
Some documents are returned without a mime type or have no extension, for example http://www.server.com/
does not imply any extension and might be an HTTP/0.9 compliant server returning HTML contents without
the Content-type header. Specifying a filter with no type will catch all these special cases.
|
To index pdf documents, you must tell Alkaline to retrieve pdf files by adding ExtsAdd=pdf to the asearch.cnf file.
An Adobe pdf filter would be used like this (notice that the extension is case-sensitive):
Filter PDF=/bin/pdftotext $1 $2
Filter pdf=/bin/pdftotext $1 $2
|
Specifying a case-sensitive extension is not very convenient; you can also specify mime types, for example:
Filter Application/Zip=/bin/specialunzip $1 $2
Filter Application/Pdf=/bin/pdftotext $1 $2
|
The variables such as $1, $2 are used to pass parameters to the filter.
Available variables are:
Table 7-1. Document Filter Automatic Variables
| $0 |
filename extension without the leading dot (ex: pdf) |
| $1 |
temporary file name that contains data remotely retrieved |
| $2 |
temporary file name that should contain results generated by the filter |
| $3 |
url of the file retrieved, not quoted |
| $4 |
data retrieved (use a temporary file, $1, instead) |
| $5 |
mime type of the file retrieved if any (such as application/zip), not quoted |
Alkaline will process objects of the following format found in the retrieved HTML documents:
<object classid="clsid:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx">
<param name="name1" value="value1">
<param name="name2" value="value2">
...
</object>
|
Such objects are embedded in the document.
After document filter processing, the resulting output of the filter will be embedded into the document as well.
To make a filter work properly for an embedded object, the following must be included into the asearch.cnf file:
# command line to execute for each object of type ClassID
Object ClassID=command line
# param value to use to retrieve the document
ObjectDocument ClassID=parameter name
|
Here is a real example for filtering Shockwave Flash embedded objects.
The shockwave flash CLSID (unique class ID) is clsid:D27CDB6E-AE6D-11cf-96B8-444553540000
and the embedded object at http://www.foo.com/bar/ looks like this:
<object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"
codeBase="http://active.macromedia.com/flash2/cabs/swflash.cab#version=3,0,0,0"
height="100%" id="navig" width="100%">
<param name="movie" value="foo.swf">
<param name="loop" value="false">
<param name="quality" value="autohigh">
<param name="menu" value="false">
</object>
|
The document defined for the Shockwave Flash object is in the variable movie,
thus the following should be added to asearch.cnf:
ObjectDocument clsid:D27CDB6E-AE6D-11cf-96B8-444553540000=movie
|
This will tell Alkaline to retrieve http://www.foo.com/bar/foo.swf as defined by the movie variable
in the object with the CLSID "clsid:D27CDB6E-AE6D-11cf-96B8-444553540000".
For the filter to be executed, it is necessary to define a valid command line.
Mapping for the command line for objects is more complete than for document filters.
All variables defined by the param
tags are available in addition of
Table 7-2. Object Filter Automatic Variables
| sourcefile |
filename that contains the retrieved document |
| targetfile |
filename that should contain the filter results |
| url |
URL where the OBJECT tag was found |
| base |
the BASE HREF of the document where the OBJECT tag was found |
For example:
# place on one single line
Object clsid:D27CDB6E-AE6D-11cf-96B8-444553540000=/usr/local/bin/swf-filter
$sourcefile --menu="$menu" > $targetfile
|
The following output was produced by Alkaline when an object filter is invoked:
[http://www.foo.com/bar/] (-1) - [639 bytes][0]
[clsid:D27CDB6E-AE6D-11cf-96B8-444553540000]
[http://www.foo.com/bar/foo.swf][200 OK][510879 bytes]
[inf][lnx][md5][vix][keys][mta][ndx][ok]
|
An Alkaline filter is a simple command line program that takes at least two arguments in any order or format: a file
name of the original document and a file name of the output. The output should be text or (partial) HTML. Your filter
can generate TITLE and META tags. Punctuation and formatting output by the filter are ignored by Alkaline.
With Alkaline, you can specify a chain of filters or any kind of command processing for your filter files.
You can also make a script that will choose whether to process a file or not. In the case when the filter(s) should
not translate the file, simply output the exact copy of the original document.
If you plan to write a new filter, make sure you visit http://www.wotsit.org
.
It is definitely the best source
for document formats and available technical resources and format specifications.
As you test or write a new filter, please email admin@vestris.com with a detailed description, examples,
source and/or binary availability, licensing information, and all other useful links and comments.