[Xapian-discuss] quick-and-dirty web search for a bunch of PDFs?
Jim Lynch
jim at fayettedigital.com
Wed May 17 12:30:42 BST 2006
A general word of caution when using pdftotext to index things. If you
pdf documents have multiple columns, the locality of terms may be
incorrect. It was my experience that pdftotext paid no attention to
colums so the last word of the first column, first sentence is followed
by the first word, second column, first sentence. This will cause
problems when searching for near by works.
For instance if we had a document that looked like
Column 1 Column 2
this is a test of the near earth discussions of this nature need to be
monitoring system. continued as quickly as possible.
... ...
A search for "earth monitoring" would fail because "earth" is followed
by "discussions". The only way I found to avoid this was to convert
the document into postscript from pdf and then from postscript into
text. Apparently pdf2ps knows how to handle mulitple columns.
The man page for pdftotext implies that it will "´undo' physical layout
(columns, hyphenation, etc.) and output the text in reading order." but
that was not my experience, at least not for the pdf files I was indexing.
Jim.
Olly Betts wrote:
>Omega's "omindex" indexer will index PDF files out of the box (just make
>sure you have pdftotext installed.)
>
>
>
More information about the Xapian-discuss
mailing list