Yes, Alkaline can index pdf documents and many other document formats. If your results come as garbage, you haven't setup the PDF filter properly. It is not invoked on the PDF document or is not working properly.
On UNIX, download and install the xpdf package from http://www.foolabs.com/xpdf/ . On Windows, download and extract the binary version of xpdf from http://alkaline.vestris.com/download/WinNT/pdf2text.zip .
Check that the pdf converter works at all. Download a pdf document (try the Alkaline FAQ from http://alkaline.vestris.com/docs/pdf/alkaline-faq.pdf ) and run the pdf filter from command line.
server:~/pdf$ pdftotext alkaline-faq.pdf alkaline-faq.txt server:~/pdf$ ls -la -rw-r--r-- 1 user users 231193 Oct 13 12:11 alkaline-faq.pdf -rw-r--r-- 1 user users 100896 Oct 20 03:48 alkaline-faq.txt |
D:\pdf>d:\pdf2html\pdftotext.exe alkaline-faq.pdf alkaline-faq.txt D:\Private\server-prod\pdf>dir 10/20/2001 11:02a 231,193 alkaline-faq.pdf 10/20/2001 11:02a 103,195 alkaline-faq.txt |
Make sure that you have a ExtsAdd=pdf and Filter pdf=... directives in your asearch.cnf and that it executes a valid command line, similar to your test above. It is often necessary to specify full paths. Note that unix path separator is a slash and a Windows path separator is a back-slash.
# example on UNIX (/alkaline/pdf/asearch.cnf) UrlList=http://alkaline.vestris.com/docs/pdf/alkaline.pdf ExtsAdd=pdf Filter pdf=/usr/local/bin/pdftotext $1 $2 Robots=N |
# example on Windows (d:\alkaline\pdf\asearch.cnf) UrlList=http://alkaline.vestris.com/docs/pdf/alkaline-faq.pdf ExtsAdd=pdf Filter pdf=d:\pdf2html\pdftotext.exe $1 $2 Robots=N |
Try indexing the configuration above, for example
server:~/alkaline$ asearch pdf reindex ... [http://alkaline.vestris.com/docs/pdf/alkaline-faq.pdf] (-1/1) - [231193 bytes][0] [ext filter: pdf][inf][lnx][md5][vix][keys][ndx][mta][ok] |
To get proper document titles and descriptions, use pdf2html . The Windows distribution of pdf2text comes with pdf2html.bat that should work by simply replacing the call to pdftotext.exe by pdf2html.bat. On unix, use the pdftohtml csh script from http://alkaline.vestris.com/filters/pdftohtml in a similar manner. Always check that those scripts work directly from command line before using them in an asearch.cnf.