Available Filters

Adobe Pdf (pdf2text and pdf2html)

The pdf2text Adobe pdf filter has been successfully tested. It is provided by Derek B. Noonburg <derekn@foolabs.com> from the xpdf tool under the GPL license. You should get xpdf which contains pdftotext and pdfinfo from the FooLabs Site at http://www.foolabs.com/xpdf/ .

The pdftotext program accepts two parameters: a source and a target filename. Thus the line to add to your asearch.cnf for the pdf filter looks like this:
Filter pdf=/bin/pdftotext $1 $2
You can avoid seeing filter errors by adding > null on Windows NT or > /dev/null 2>&1 on Unix to the command line:
Filter pdf=/bin/pdftotext $1 $2 > /dev/null 2>&1

You can use pdftotext along with pdfinfo in order to generate HTML content and benefit from meta keywords, author and document title. Derek's pdftohtml shell script for UNIX implements those features and is available at http://alkaline.vestris.com/filters/pdftohtml .

You can pickup an executable of pdf2text, gzip and a pdf2html for Windows NT in the Alkaline distribution directory at http://alkaline.vestris.com/download/WinNT/pdf2text.zip . The asearch.cnf configuration directives must use fully qualified Windows paths:
Filter pdf=c:\tools\pdftotext.exe $1 $2
or, if you are using pdf2html:
Filter pdf=c:\tools\pdf2html.bat $1 $2

Microsoft Word (vwHtml)

The wvWare Microsoft Word filter has been successfully tested. It is also known as former MsWordView and is provided by Caol´n McNamara <caolan.mcnamara@ul.ie>. It can be found at http://www.wvware.com/ under the GPL license. WvWare can load and parse the Word 2000, 97, 95 and 6 file formats.

The filter syntax that should be employed is simply
Filter doc=/bin/wvHtml $1 > $2

Microsoft Rich Text Format (rtf2html)

Rtf2Html is a commercial software from Chris Hector <chris@sunpack.com> and is available at http://www.sunpack.com/RTF/ . This filter has not been tested.

LaTex / Tex (LaTex2Html)

TeX is a typesetting system written by Donald E. Knuth. LaTeX is a TeX macro package, originally written by Leslie Lamport, that provides a document processing system. A great Tex FAQ can be found at http://www.tex.ac.uk/cgi-bin/texfaq2html .

A LaTex2HTML converter is available from Nicos Drakos <nikos@mpn.com> of the University of Leeds at http://cbl.leeds.ac.uk/nikos/tex2html/doc/latex2html/latex2html.html . This filter has not been tested.

Word Perfect, AmiPro, Wang WPS (Plus), etc.

A commercial software for Windows, called WebConvert, available from http://www.webconvert.com/ , promises to convert most of the major document formats. It can be used from the command line, so it can run as an Alkaline filter. It has not been tested.

Shockwave Flash

You can grab the source code in C of a tested Shockwave Flash decoder at http://alkaline.vestris.com/filters/swfparse.cpp . A volunteer is welcome to transform this into a real filter that would extract text and links.

Extensible Markup Language (Xml)

Xml documents are structured differently and need some processing in order to be indexed and useful. This can be done with a freeware tool Xml2, by Dan Egnor, at http://ofb.net/~egnor/xml2/ . Also make sure to check Pixie, by Sean McGrath, at http://www.xml.com/pub/a/2000/03/15/feature/ , an open source XML processing library.

To generate meta tags from the output of xml2, use the following xml2html.awk script:
/[^\=]*/ {
    start = index($0, "\=")
    NAME = substr($0, 2, start - 2)
    gsub("/", "-", NAME)
    gsub("\@", "-", NAME)
    gsub("--", "-", NAME)
    CONTENT = substr($0, start + 1, length($0))
    print "<meta name=\"" NAME "\" content=\"" CONTENT "\">"
    next
}

{
    print
}
For example
Filter xml=/usr/bin/xml2 < $1 | awk -f /usr/bin/xml2html.awk > $2

MPEG Layer 3 Music (Mp3)

MPEG layer 3 is a type of audio codec where processed by significant compression from the original audio source with very little loss in sound quality.

Mp3 files have a blob of data associated to them, called ID3. This information can contain the song title, artist, album and more. There're hundreds of programs that provide extraction of those tags. For the simplest Mp3 indexing we suggest Id3Tool by Chris "Crossfire" Collins (http://kitsumi.xware.cx ). You can download this tool as source code or binary format at http://freshmeat.net/projects/id3tool/

In addition, Mp3 encoding information can be retrieved using Mp3Header by Owen Llyod (http://owl.yi.org ), available in source or binary format at http://owl.yi.org/programs/#mp3header .

The output of id3tool and mp3header is typically:
$ id3tool "02 Les Enfoires.mp3"

Filename: 02 Les Enfoires.mp3
Song Title:     Quand on n'a que l'amour
Artist:         Céline Dion & Mauranne
Album:          La soirée des enfoirés 96
Note:           Profits aux Restos du coeur
Year:           1996
Genre:          Chanson (0x66)
$ mp3header "02 Les Enfoires.mp3" 

02 Les Enfoires.mp3 - File Data
-------------------------------
File Size:      4188160 bytes
Est. Time:      209 secs
MPEG Version:   1
MPEG Layer:     III
BitRate:        160 kBit/s
Sample Freq:    44100 kHz
Padding:        No
Mode:           Joint Stereo
Private:        No
Copyright:      No
Orginal:        No
Emphasis:       None

This needs to be transformed into html format and can be done using the following awk script:
function trim(input)
{
    result = ""
    n = split(input, words, " ")
    for (i = 1; i <= n; i++)
    {
       if (words[i] != " ")
       {
           if (length(result) > 0)
           {
                result = result " "
           }
           result = result words[i]
       }
    }
    return result;
}


{
    start = index($0, ":")
    if (start == -1)
    {
        next
    }
    NAME = trim(substr($0, 0, start - 1))
    CONTENT = trim(substr($0, start + 1, length($0)))
    if (NAME == "Song Title")
    {
         print "<title>" CONTENT "</title>"
    }
    else if (length(NAME))
    {
         print "<meta name=\"" NAME "\" content=\"" CONTENT "\">"
    }
    next
}

The complete filter command line looks like this:
Filter mp3=id3tool $1 | awk -f mp32html.awk > $2 ; mp3header $1 | awk -f mp32html.awk >> $2
The output of the line above command for this Mp3 file can be indexed by Alkaline:
<meta name="Filename" content="02 Les Enfoires.mp3">
<title>Quand on n'a que l'amour</title>
<meta name="Artist" content="Céline Dion & Mauranne">
<meta name="Album" content="La soirée des enfoirés 96">
<meta name="Note" content="Profits aux Restos du coeur">
<meta name="Year" content="1996">
<meta name="Genre" content="Chanson (0x66)">
<meta name="File Size" content="4188160 bytes">
<meta name="Est. Time" content="209 secs">
<meta name="MPEG Version" content="1">
<meta name="MPEG Layer" content="III">
<meta name="BitRate" content="160 kBit/s">
<meta name="Sample Freq" content="44100 kHz">
<meta name="Padding" content="No">
<meta name="Mode" content="Joint Stereo">
<meta name="Private" content="No">
<meta name="Copyright" content="No">
<meta name="Orginal" content="No">
<meta name="Emphasis" content="None">
Use the CustomMetas directive to expose required meta tags to the results page.

Other Sources

A site worth visiting for filters is http://www.w3.org/Tools/Filters.html . It has an extensive list of filters available.

Keypack (http://www.keypak.com/ ) and Blueberry Filtrex (http://www.blueberry.com/ ) claim to convert a huge amount of document formats.