[Xapian-discuss] Indexing PDF, DOC etc.

Charlie Hull charlie at juggler.net
Thu Nov 6 11:49:46 GMT 2008


Florian Beer wrote:
> I'm trying to index PDFs that are stored in a MySQL database (blob  
> field) using omindex now.
> What's the exact call to tell omindex to index a byte stream (passed  
> directly from my Python programm) instead of specifying a directory on  
> the commandline?

I'm not sure you can do this.
> 
> Is this even possible, or would I have to first write the PDF data out  
> from the MySQL to a temporary file, let it index (supplying arbitrary  
> metadata) and then delete the temp file?

I think your best bet would be to examine how omindex handles PDFs 
(AFAIK it uses pdfinfo to extract metadata and pdf2text to extract the 
text) and use the same method in your Python program. I suspect you may 
need to write a temporary file as you suggest.

You haven't specified your operating system: on Windows you also have 
the option of IFilters (this is what Flax www.flax.co.uk uses - Flax is 
written in Python which might be helpful, you should also look at Xappy 
linked from the same page which is a Python high-level interface to Xapian).

Charlie

> 
> 
> On Nov 5, 2008, at 11:52 , Charlie Hull wrote:
> 
>> Florian Beer wrote:
>>> Hello dear list,
>>>
>>> I'm trying to index various types of files with Xapian, used in a
>>> Python program.
>>> Text and HTML work fine via index_text() but I can't find any
>>> explanations for indexing other types of files.
>>>
>>> Is it the case that _everyting_ has to be converted to text prior to
>>> indexing it?
>>> I didn't find a definitive answer to that anywhere on the WWW, some
>>> mailing lists and the Xapian documentation.
>>> (I only found references to e.g. pdf2text and the like)
>> Yes. However you can do this using the provided application Omega, in
>> particular the program Omindex. You can find this on the Xapian  
>> website.
>>
>> Charlie
>>> I was thinking, from reading Xapaian's features page, that it can
>>> natively index a vast amount of different file types. If I do need to
>>> convert everything to text first, that would mean Xapian can - in
>>> reality - only work with plain text, which would make it rather
>>> useless for my purpose.
>>>
>>> Thanks in advance for sharing any insights,
>>> Florian
>>>
>>> _______________________________________________
>>> Xapian-discuss mailing list
>>> Xapian-discuss at lists.xapian.org
>>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>>
>>
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>>
> 
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
> 




More information about the Xapian-discuss mailing list