[Xapian-devel] Proposed changes to omindex

Sat Aug 26 23:13:37 BST 2006

On Sat, Aug 19, 2006 at 07:22:10PM +0100, James Aylett wrote:
> (Although any decent network fs has built-in caching, and in any
> case you could rely on the OS buffers - if you open() first, then dup
> the filedes, then use fdopen() to turn it into a FILE* - twice -
> there's very little reason you'll have to hit the network twice, even
> on a lame net fs.

Some of the format conversion filters want a filename for the input, so
you can't open the file once and dup the file descriptor (pdftotext for
example).  Those that can read from stdin (e.g. antiword) could be
handled this way if it actually helps.

> One idea I've talked to someone about is separating omindex into
> something that drives scriptindex, which in theory would allow you to
> use the file spider in omindex with whatever indexing strategy you
> wanted.

Perhaps that was me, or possibly we've both discussed it with Richard
separately?

Anyway, it's an interesting idea, though it might add measurable
overhead.  A step towards it is that I've recently added a "load"
command to scriptindex which allows you to write an index script which
takes a filename to read and index the contents of.

> I'd certainly favour having a way of running the query parser that
> didn't need R-terms, [...]

There already is: QueryParser::set_stemming_strategy() can be called
with STEM_NONE or STEM_ALL (the default is STEM_SOME).

Cheers,
    Olly