[Xapian-discuss] quick-and-dirty web search for a bunch of PDFs?

Tim Brody tdb01r at ecs.soton.ac.uk
Wed May 17 13:10:04 BST 2006


Jim Lynch wrote:
> A general word of caution when using pdftotext to index things.  If you 
> pdf documents have multiple columns, the locality of terms may be 
> incorrect.  It was my experience that pdftotext paid no attention to 
> colums so the last word of the first column, first sentence is followed 
> by the first word, second column, first sentence.  This will cause 
> problems when searching for  near by works.
> 
> For instance if we had a document that looked like
>  Column 1                                       Column 2
> this is a test of the near earth      discussions of this nature need to be
> monitoring system.                     continued as quickly as possible.
> ...                                                   ...

Version 3 is supposed to to do this, but I agree it's flakey.

Try using the -raw option, which outputs the text in 'stream order'.

Another possibility is 'PDFBox' (a JAVA API for PDF), which comes with a 
text extraction tool:
http://www.pdfbox.org/

Tim.



More information about the Xapian-discuss mailing list