[Xapian-discuss] Indexing PDF, DOC etc.

Olly Betts olly at survex.com
Sun Nov 9 14:50:12 GMT 2008


I guess this has mostly been covered already, but I think it's worth
explicitly addressing the high-level "WHY"...

On Wed, Nov 05, 2008 at 11:44:00AM +0100, Florian Beer wrote:
> I was thinking, from reading Xapaian's features page, that it can  
> natively index a vast amount of different file types.

The Xapian API doesn't natively support extracting text from any
filetypes.  There are already good quality open source converters for
most common formats, so it's not a productive use of our time to
duplicate that work.

In general, we resist adding features that aren't "search", particularly
if they can already be done using other existing projects.  This keeps
down the amount of code we need to write and maintain, and allows us to
focus our efforts on making Xapian as good as we can at what it does do.

Sometimes there's a good argument for including something - e.g. Unicode
support is required from inside the QueryParser and TermGenerator
classes, and we make this available via the API since we have to
maintain the code anyway, and you really want to be using consistent
character classifications, etc and they can change slightly between
Unicode versions.

> If I do need to convert everything to text first, that would mean
> Xapian can - in reality - only work with plain text, which would make
> it rather useless for my purpose.

To index text from non-plaintext formats, just use a conversion library
or utility to extract text, and Xapian for the indexing/searching part.

Cheers,
    Olly



More information about the Xapian-discuss mailing list