[Xapian-discuss] Indexing PDF, DOC etc.
florian.beer at dark-green.com
Thu Nov 6 11:31:03 GMT 2008
I'm trying to index PDFs that are stored in a MySQL database (blob
field) using omindex now.
What's the exact call to tell omindex to index a byte stream (passed
directly from my Python programm) instead of specifying a directory on
Is this even possible, or would I have to first write the PDF data out
from the MySQL to a temporary file, let it index (supplying arbitrary
metadata) and then delete the temp file?
On Nov 5, 2008, at 11:52 , Charlie Hull wrote:
> Florian Beer wrote:
>> Hello dear list,
>> I'm trying to index various types of files with Xapian, used in a
>> Python program.
>> Text and HTML work fine via index_text() but I can't find any
>> explanations for indexing other types of files.
>> Is it the case that _everyting_ has to be converted to text prior to
>> indexing it?
>> I didn't find a definitive answer to that anywhere on the WWW, some
>> mailing lists and the Xapian documentation.
>> (I only found references to e.g. pdf2text and the like)
> Yes. However you can do this using the provided application Omega, in
> particular the program Omindex. You can find this on the Xapian
>> I was thinking, from reading Xapaian's features page, that it can
>> natively index a vast amount of different file types. If I do need to
>> convert everything to text first, that would mean Xapian can - in
>> reality - only work with plain text, which would make it rather
>> useless for my purpose.
>> Thanks in advance for sharing any insights,
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
More information about the Xapian-discuss