[Xapian-devel] Omega changes

Olly Betts olly at survex.com
Fri Dec 17 15:31:47 GMT 2004


On Fri, Dec 17, 2004 at 03:05:13PM +0000, James Aylett wrote:
> On Fri, Dec 17, 2004 at 02:15:34PM +0000, Richard Boulton wrote:
> > Actually, Olly suggested that it might be sensible to remove the
> > duplicates options entirely, and simply default to the behaviour
> > specified above.  Does anyone actually use omindex with a --duplicates
> > option other than "replace"?
> 
> I doubt it very much. They're only there for some measure of backwards
> compatibility in case anyone actually liked the old way of working.
> 
> --duplicates=ignore was designed to save time when you only add
> documents to the corpus. Shouldn't be needed with
> --duplicates=timestamp, and I can't think of a good reason to use
> replace instead of timestamp.

The other function that --duplicates=ignore serves is to allow you to
index documents which aren't all under the same directory.  For example,
you could index "/var/www" -> "/", then loop through users and add in
"/home/$USER/public_html" -> "/~$USER/".  To be honest, I think this
belongs in "omindex.conf".  That way it's much easier to update the
database without slipping up - otherwise you really need a shell script
to remember the sequence of actions for you.

If that's addressed, the only real reason to keep anything like
"--duplicates=ignore" is it might be faster in the case where you know
you've not modified any documents, only added (or removed) some.  Then
you can save a call to stat and reading the document data (or a
value) for each unchanged document.  It might be worth profiling
that before removing this entirely.  If it is significant, then I'd
suggest we provide an option to force omindex to assume an existing
document won't be modified - but it should probably still delete it
from the index if it isn't on disk (which is different to
"--duplicates=ignore").

> However: how would you cope with this with two databases with
> different indexing options? Specifically, is there anything sane we
> can do with different stemmers in use?

Not if you try to search over them.  At least not without serious work
to the insides of xapian itself, or some funky merging of results
from separate searches.

I don't think it's actually terribly useful to search databases in
different languages.  Only proper names are going to usefully feature
across them, and even those often vary (e.g. "London" vs "Londres",
"Rome" vs "Roma").

Cheers,
    Olly




More information about the Xapian-devel mailing list