[Xapian-discuss] Indexing PDF, DOC etc.

Thu Nov 6 11:31:03 GMT 2008

I'm trying to index PDFs that are stored in a MySQL database (blob  
field) using omindex now.
What's the exact call to tell omindex to index a byte stream (passed  
directly from my Python programm) instead of specifying a directory on  
the commandline?

Is this even possible, or would I have to first write the PDF data out  
from the MySQL to a temporary file, let it index (supplying arbitrary  
metadata) and then delete the temp file?

On Nov 5, 2008, at 11:52 , Charlie Hull wrote:

> Florian Beer wrote:
>> Hello dear list,
>>
>> I'm trying to index various types of files with Xapian, used in a
>> Python program.
>> Text and HTML work fine via index_text() but I can't find any
>> explanations for indexing other types of files.
>>
>> Is it the case that _everyting_ has to be converted to text prior to
>> indexing it?
>> I didn't find a definitive answer to that anywhere on the WWW, some
>> mailing lists and the Xapian documentation.
>> (I only found references to e.g. pdf2text and the like)
>
> Yes. However you can do this using the provided application Omega, in
> particular the program Omindex. You can find this on the Xapian  
> website.
>
> Charlie
>>
>> I was thinking, from reading Xapaian's features page, that it can
>> natively index a vast amount of different file types. If I do need to
>> convert everything to text first, that would mean Xapian can - in
>> reality - only work with plain text, which would make it rather
>> useless for my purpose.
>>
>> Thanks in advance for sharing any insights,
>> Florian
>>
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
>