[Xapian-discuss] Indexing PDF, DOC etc.
Florian Beer
florian.beer at dark-green.com
Wed Nov 5 10:44:00 GMT 2008
Hello dear list,
I'm trying to index various types of files with Xapian, used in a
Python program.
Text and HTML work fine via index_text() but I can't find any
explanations for indexing other types of files.
Is it the case that _everyting_ has to be converted to text prior to
indexing it?
I didn't find a definitive answer to that anywhere on the WWW, some
mailing lists and the Xapian documentation.
(I only found references to e.g. pdf2text and the like)
I was thinking, from reading Xapaian's features page, that it can
natively index a vast amount of different file types. If I do need to
convert everything to text first, that would mean Xapian can - in
reality - only work with plain text, which would make it rather
useless for my purpose.
Thanks in advance for sharing any insights,
Florian
More information about the Xapian-discuss
mailing list