[Xapian-devel] Proposed changes to omindex
James Aylett
james-xapian at tartarus.org
Sun Aug 27 15:27:08 BST 2006
On Sat, Aug 26, 2006 at 11:13:37PM +0100, Olly Betts wrote:
> Some of the format conversion filters want a filename for the input, so
> you can't open the file once and dup the file descriptor (pdftotext for
> example). Those that can read from stdin (e.g. antiword) could be
> handled this way if it actually helps.
Well, strictly speaking we can LD_PRELOAD filters that can't act as
stream filters to death, although that only works on modern Unices. We
shouldn't really rely on that, though :-)
Most filters would accept a patch to work from stdin if they don't
already, and it wouldn't be too difficult to do. That would benefit
everyone, if we run into some common ones.
I've no idea whether it actually will help, in practice. I suspect
that in most cases, it's not actually going to win you much because
the file buffering will do the right thing already.
> > One idea I've talked to someone about is separating omindex into
> > something that drives scriptindex, which in theory would allow you to
> > use the file spider in omindex with whatever indexing strategy you
> > wanted.
>
> Perhaps that was me, or possibly we've both discussed it with Richard
> separately?
I've no idea :-)
> Anyway, it's an interesting idea, though it might add measurable
> overhead. A step towards it is that I've recently added a "load"
> command to scriptindex which allows you to write an index script which
> takes a filename to read and index the contents of.
If we retain omindex's approach for HTML (which it understands
natively) and anything that filters to plain text, and just allow
people to write filters that generate scriptindex input files (with
the filter being associated with an index script), then we get more
flexibility in omindex without having to sacrifice efficiency of
indexing in the common case.
That would also allow decent indexing of anything that embedded XMP,
incidentally. This is considered A Good Thing, at least by me.
> > I'd certainly favour having a way of running the query parser that
> > didn't need R-terms, [...]
>
> There already is: QueryParser::set_stemming_strategy() can be called
> with STEM_NONE or STEM_ALL (the default is STEM_SOME).
Ah, excellent. Is this documented anywhere? Can't remember seeing it...
James
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james at tartarus.org uncertaintydivision.org
More information about the Xapian-devel
mailing list