The pdf2text
Adobe pdf filter has been successfully tested.
It is provided by Derek B. Noonburg
<derekn@foolabs.com> from the xpdf tool under the GPL license.
You should get xpdf which contains pdftotext and pdfinfo from the FooLabs Site at
http://www.foolabs.com/xpdf/
.
The pdftotext program accepts two parameters: a source and a target filename.
Thus the line to add to your asearch.cnf for the pdf filter looks like this:
Filter pdf=/bin/pdftotext $1 $2
|
You can avoid seeing filter errors by adding
> null
on Windows NT
or
> /dev/null 2>&1
on Unix to the command line:
Filter pdf=/bin/pdftotext $1 $2 > /dev/null 2>&1
|
You can use pdftotext along with pdfinfo in order to generate HTML content and benefit from meta keywords,
author and document title. Derek's pdftohtml shell script for UNIX implements those features and is available
at http://alkaline.vestris.com/filters/pdftohtml
.
You can pickup an executable of pdf2text, gzip and a pdf2html for Windows NT in the Alkaline distribution directory
at http://alkaline.vestris.com/download/WinNT/pdf2text.zip
.
The asearch.cnf configuration directives must use fully qualified Windows paths:
Filter pdf=c:\tools\pdftotext.exe $1 $2
|
or, if you are using pdf2html:
Filter pdf=c:\tools\pdf2html.bat $1 $2
|
The wvWare
Microsoft Word filter has been successfully tested.
It is also known as former MsWordView and is provided by Caol´n McNamara <caolan.mcnamara@ul.ie>.
It can be found at http://www.wvware.com/
under the GPL license.
WvWare can load and parse the Word 2000, 97, 95 and 6 file formats.
The filter syntax that should be employed is simply
Filter doc=/bin/wvHtml $1 > $2
|
Rtf2Html is a commercial software from Chris Hector <chris@sunpack.com> and is available at
http://www.sunpack.com/RTF/
.
This filter has not been tested.
A commercial software for Windows, called WebConvert, available from
http://www.webconvert.com/
,
promises to convert most of the major document formats.
It can be used from the command line, so it can run as an Alkaline filter.
It has not been tested.
Xml documents are structured differently and need some processing in order to be indexed and useful.
This can be done with a freeware tool Xml2, by Dan Egnor, at
http://ofb.net/~egnor/xml2/
.
Also make sure to check Pixie, by Sean McGrath, at
http://www.xml.com/pub/a/2000/03/15/feature/
,
an open source XML processing library.
To generate meta tags from the output of xml2, use the following xml2html.awk script:
/[^\=]*/ {
start = index($0, "\=")
NAME = substr($0, 2, start - 2)
gsub("/", "-", NAME)
gsub("\@", "-", NAME)
gsub("--", "-", NAME)
CONTENT = substr($0, start + 1, length($0))
print "<meta name=\"" NAME "\" content=\"" CONTENT "\">"
next
}
{
print
}
|
For example
Filter xml=/usr/bin/xml2 < $1 | awk -f /usr/bin/xml2html.awk > $2
|
MPEG layer 3 is a type of audio codec where processed by significant compression from the original audio source
with very little loss in sound quality.
Mp3 files have a blob of data associated to them, called ID3. This information can contain the song title,
artist, album and more. There're hundreds of programs that provide extraction of those tags. For the simplest Mp3
indexing we suggest Id3Tool by Chris "Crossfire" Collins (http://kitsumi.xware.cx
).
You can download this tool as source code or binary format at
http://freshmeat.net/projects/id3tool/
In addition, Mp3 encoding information can be retrieved using Mp3Header by Owen Llyod
(http://owl.yi.org
), available in source or binary format at
http://owl.yi.org/programs/#mp3header
.
The output of id3tool and mp3header is typically:
$ id3tool "02 Les Enfoires.mp3"
Filename: 02 Les Enfoires.mp3
Song Title: Quand on n'a que l'amour
Artist: Céline Dion & Mauranne
Album: La soirée des enfoirés 96
Note: Profits aux Restos du coeur
Year: 1996
Genre: Chanson (0x66)
|
$ mp3header "02 Les Enfoires.mp3"
02 Les Enfoires.mp3 - File Data
-------------------------------
File Size: 4188160 bytes
Est. Time: 209 secs
MPEG Version: 1
MPEG Layer: III
BitRate: 160 kBit/s
Sample Freq: 44100 kHz
Padding: No
Mode: Joint Stereo
Private: No
Copyright: No
Orginal: No
Emphasis: None
|
This needs to be transformed into html format and can be done using the following awk script:
function trim(input)
{
result = ""
n = split(input, words, " ")
for (i = 1; i <= n; i++)
{
if (words[i] != " ")
{
if (length(result) > 0)
{
result = result " "
}
result = result words[i]
}
}
return result;
}
{
start = index($0, ":")
if (start == -1)
{
next
}
NAME = trim(substr($0, 0, start - 1))
CONTENT = trim(substr($0, start + 1, length($0)))
if (NAME == "Song Title")
{
print "<title>" CONTENT "</title>"
}
else if (length(NAME))
{
print "<meta name=\"" NAME "\" content=\"" CONTENT "\">"
}
next
}
|
The complete filter command line looks like this:
Filter mp3=id3tool $1 | awk -f mp32html.awk > $2 ; mp3header $1 | awk -f mp32html.awk >> $2
|
The output of the line above command for this Mp3 file can be indexed by Alkaline:
<meta name="Filename" content="02 Les Enfoires.mp3">
<title>Quand on n'a que l'amour</title>
<meta name="Artist" content="Céline Dion & Mauranne">
<meta name="Album" content="La soirée des enfoirés 96">
<meta name="Note" content="Profits aux Restos du coeur">
<meta name="Year" content="1996">
<meta name="Genre" content="Chanson (0x66)">
<meta name="File Size" content="4188160 bytes">
<meta name="Est. Time" content="209 secs">
<meta name="MPEG Version" content="1">
<meta name="MPEG Layer" content="III">
<meta name="BitRate" content="160 kBit/s">
<meta name="Sample Freq" content="44100 kHz">
<meta name="Padding" content="No">
<meta name="Mode" content="Joint Stereo">
<meta name="Private" content="No">
<meta name="Copyright" content="No">
<meta name="Orginal" content="No">
<meta name="Emphasis" content="None">
|
Use the
CustomMetas
directive to expose required
meta tags to the results page.