[Xapian-discuss] Indexing PDF, DOC etc.

Wed Nov 5 10:44:00 GMT 2008

Hello dear list,

I'm trying to index various types of files with Xapian, used in a  
Python program.
Text and HTML work fine via index_text() but I can't find any  
explanations for indexing other types of files.

Is it the case that _everyting_ has to be converted to text prior to  
indexing it?
I didn't find a definitive answer to that anywhere on the WWW, some  
mailing lists and the Xapian documentation.
(I only found references to e.g. pdf2text and the like)

I was thinking, from reading Xapaian's features page, that it can  
natively index a vast amount of different file types. If I do need to  
convert everything to text first, that would mean Xapian can - in  
reality - only work with plain text, which would make it rather  
useless for my purpose.

Thanks in advance for sharing any insights,
Florian