[Xapian-discuss] Indexing PDF, DOC etc.

Olly Betts olly at survex.com
Thu Nov 6 13:48:20 GMT 2008


On 06/11/2008, Florian Beer <florian.beer at dark-green.com> wrote:
> I'm trying to index PDFs that are stored in a MySQL database (blob
> field) using omindex now.
> What's the exact call to tell omindex to index a byte stream (passed
> directly from my Python programm) instead of specifying a directory on
> the commandline?

I don't think there is a way to.

> Is this even possible, or would I have to first write the PDF data out
> from the MySQL to a temporary file, let it index (supplying arbitrary
> metadata) and then delete the temp file?

We just feed the PDF to an external program (pdftotext from memory, but
check the documentation or omindex source code).  If that program accepts
a PDF file on stdin, then it might be worth considering a patch to allow this
(possibly via scriptindex rather than omindex).  But if it requires a
file to work
from then someone will have to generate that file anyway.

Or you could just call pdftotext directly from your python script,
then feed that
data to Xapian via the Python bindings (see the TermGenerator class).  This
seems a more natural solution to me.

Cheers,
    Olly



More information about the Xapian-discuss mailing list