The pdf2text Adobe pdf filter has been successfully tested. It is provided by Derek B. Noonburg <derekn@foolabs.com> from the xpdf tool under the GPL license. You should get xpdf which contains pdftotext and pdfinfo from the FooLabs Site at http://www.foolabs.com/xpdf/ .
The pdftotext program accepts two parameters: a source and a target filename. Thus the line to add to your asearch.cnf for the pdf filter looks like this:
Filter pdf=/bin/pdftotext $1 $2 |
Filter pdf=/bin/pdftotext $1 $2 > /dev/null 2>&1 |
You can use pdftotext along with pdfinfo in order to generate HTML content and benefit from meta keywords, author and document title. Derek's pdftohtml shell script for UNIX implements those features and is available at http://alkaline.vestris.com/filters/pdftohtml .
You can pickup an executable of pdf2text, gzip and a pdf2html for Windows NT in the Alkaline distribution directory at http://alkaline.vestris.com/download/WinNT/pdf2text.zip . The asearch.cnf configuration directives must use fully qualified Windows paths:
Filter pdf=c:\tools\pdftotext.exe $1 $2 |
Filter pdf=c:\tools\pdf2html.bat $1 $2 |
The wvWare Microsoft Word filter has been successfully tested. It is also known as former MsWordView and is provided by Caol´n McNamara <caolan.mcnamara@ul.ie>. It can be found at http://www.wvware.com/ under the GPL license. WvWare can load and parse the Word 2000, 97, 95 and 6 file formats.
The filter syntax that should be employed is simply
Filter doc=/bin/wvHtml $1 > $2 |
Rtf2Html is a commercial software from Chris Hector <chris@sunpack.com> and is available at http://www.sunpack.com/RTF/ . This filter has not been tested.
TeX is a typesetting system written by Donald E. Knuth. LaTeX is a TeX macro package, originally written by Leslie Lamport, that provides a document processing system. A great Tex FAQ can be found at http://www.tex.ac.uk/cgi-bin/texfaq2html .
A LaTex2HTML converter is available from Nicos Drakos <nikos@mpn.com> of the University of Leeds at http://cbl.leeds.ac.uk/nikos/tex2html/doc/latex2html/latex2html.html . This filter has not been tested.
A commercial software for Windows, called WebConvert, available from http://www.webconvert.com/ , promises to convert most of the major document formats. It can be used from the command line, so it can run as an Alkaline filter. It has not been tested.
You can grab the source code in C of a tested Shockwave Flash decoder at http://alkaline.vestris.com/filters/swfparse.cpp . A volunteer is welcome to transform this into a real filter that would extract text and links.
Xml documents are structured differently and need some processing in order to be indexed and useful. This can be done with a freeware tool Xml2, by Dan Egnor, at http://ofb.net/~egnor/xml2/ . Also make sure to check Pixie, by Sean McGrath, at http://www.xml.com/pub/a/2000/03/15/feature/ , an open source XML processing library.
To generate meta tags from the output of xml2, use the following xml2html.awk script:
/[^\=]*/ {
start = index($0, "\=")
NAME = substr($0, 2, start - 2)
gsub("/", "-", NAME)
gsub("\@", "-", NAME)
gsub("--", "-", NAME)
CONTENT = substr($0, start + 1, length($0))
print "<meta name=\"" NAME "\" content=\"" CONTENT "\">"
next
}
{
print
}
|
Filter xml=/usr/bin/xml2 < $1 | awk -f /usr/bin/xml2html.awk > $2 |
MPEG layer 3 is a type of audio codec where processed by significant compression from the original audio source with very little loss in sound quality.
Mp3 files have a blob of data associated to them, called ID3. This information can contain the song title, artist, album and more. There're hundreds of programs that provide extraction of those tags. For the simplest Mp3 indexing we suggest Id3Tool by Chris "Crossfire" Collins (http://kitsumi.xware.cx ). You can download this tool as source code or binary format at http://freshmeat.net/projects/id3tool/
In addition, Mp3 encoding information can be retrieved using Mp3Header by Owen Llyod (http://owl.yi.org ), available in source or binary format at http://owl.yi.org/programs/#mp3header .
The output of id3tool and mp3header is typically:
$ id3tool "02 Les Enfoires.mp3" Filename: 02 Les Enfoires.mp3 Song Title: Quand on n'a que l'amour Artist: Céline Dion & Mauranne Album: La soirée des enfoirés 96 Note: Profits aux Restos du coeur Year: 1996 Genre: Chanson (0x66) |
$ mp3header "02 Les Enfoires.mp3" 02 Les Enfoires.mp3 - File Data ------------------------------- File Size: 4188160 bytes Est. Time: 209 secs MPEG Version: 1 MPEG Layer: III BitRate: 160 kBit/s Sample Freq: 44100 kHz Padding: No Mode: Joint Stereo Private: No Copyright: No Orginal: No Emphasis: None |
This needs to be transformed into html format and can be done using the following awk script:
function trim(input)
{
result = ""
n = split(input, words, " ")
for (i = 1; i <= n; i++)
{
if (words[i] != " ")
{
if (length(result) > 0)
{
result = result " "
}
result = result words[i]
}
}
return result;
}
{
start = index($0, ":")
if (start == -1)
{
next
}
NAME = trim(substr($0, 0, start - 1))
CONTENT = trim(substr($0, start + 1, length($0)))
if (NAME == "Song Title")
{
print "<title>" CONTENT "</title>"
}
else if (length(NAME))
{
print "<meta name=\"" NAME "\" content=\"" CONTENT "\">"
}
next
}
|
The complete filter command line looks like this:
Filter mp3=id3tool $1 | awk -f mp32html.awk > $2 ; mp3header $1 | awk -f mp32html.awk >> $2 |
<meta name="Filename" content="02 Les Enfoires.mp3"> <title>Quand on n'a que l'amour</title> <meta name="Artist" content="Céline Dion & Mauranne"> <meta name="Album" content="La soirée des enfoirés 96"> <meta name="Note" content="Profits aux Restos du coeur"> <meta name="Year" content="1996"> <meta name="Genre" content="Chanson (0x66)"> <meta name="File Size" content="4188160 bytes"> <meta name="Est. Time" content="209 secs"> <meta name="MPEG Version" content="1"> <meta name="MPEG Layer" content="III"> <meta name="BitRate" content="160 kBit/s"> <meta name="Sample Freq" content="44100 kHz"> <meta name="Padding" content="No"> <meta name="Mode" content="Joint Stereo"> <meta name="Private" content="No"> <meta name="Copyright" content="No"> <meta name="Orginal" content="No"> <meta name="Emphasis" content="None"> |
A site worth visiting for filters is http://www.w3.org/Tools/Filters.html . It has an extensive list of filters available.
Keypack (http://www.keypak.com/ ) and Blueberry Filtrex (http://www.blueberry.com/ ) claim to convert a huge amount of document formats.