[Xapian-discuss] Indexing PDF, DOC etc.

Wed Nov 5 10:52:34 GMT 2008

Florian Beer wrote:
> Hello dear list,
> 
> I'm trying to index various types of files with Xapian, used in a  
> Python program.
> Text and HTML work fine via index_text() but I can't find any  
> explanations for indexing other types of files.
> 
> Is it the case that _everyting_ has to be converted to text prior to  
> indexing it?
> I didn't find a definitive answer to that anywhere on the WWW, some  
> mailing lists and the Xapian documentation.
> (I only found references to e.g. pdf2text and the like)

Yes. However you can do this using the provided application Omega, in 
particular the program Omindex. You can find this on the Xapian website.

Charlie
> 
> I was thinking, from reading Xapaian's features page, that it can  
> natively index a vast amount of different file types. If I do need to  
> convert everything to text first, that would mean Xapian can - in  
> reality - only work with plain text, which would make it rather  
> useless for my purpose.
> 
> Thanks in advance for sharing any insights,
> Florian
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>