[Xapian-devel] Omega changes

Fri Dec 17 16:47:14 GMT 2004

On Fri, Dec 17, 2004 at 03:55:03PM +0000, James Aylett wrote:
> If you get duplicates while doing that, you want to overwrite, [...]
> Is there something else I'm missing?

The issue isn't duplicates - it's that you don't want to REMOVE all the
files from "/" when you index "/~olly/"...

> Yeek. I've just looked at the deletion code, and I'm not convinced it
> works. It seems to delete everything it doesn't update, rather than
> everything it doesn't find: in index_file() if dupes==DUPE_ignore and
> it's a dupe, we just return, rather than setting updated[doc_id] (of
> course, we don't get doc_id in the test, for speed, so that would be a
> pain to do).

Hmm, the mechanism is only meant to be used in the DUPE_replace case,
but the code to remove the documents fires whatever dupes is set to.
That's a bug.  It also rather suggests nobody uses anything but
the default of DUPE_replace!

But your point may mean that a "assume still existing documents aren't
modified" flag wouldn't be a win since we'd want the document id in
this case anyway.  Or that it should do exactly what DUPE_ignore is
meant to, and only add unindexed documents.

> (We're relying on document IDs never being reused within a single
> database, which I know is okay, but I can't find it documented
> anywhere as being true, which it probably should be. Maybe I'm looking
> in the wrong place - I'd expect a note in the WritableDatabase docs.)

It is true, but quite possibly un- (or under-) documented.

> If we want to remove --duplicates=ignore, we'll have to profile that
> against different OS/FS combinations. While --duplicates=ignore will
> certainly be faster, I can imagine quite a variance of stat/compare on
> different systems. Probably better to leave it in somehow, probably by
> replacing --duplicates with --ignore-duplicates.

Actually, I suspect the stat time will be swamped by the time to find
the modtime from Xapian.  Although perhaps not if lots is cached.

> > > is there anything sane we can do with different stemmers in use?
> > 
> > Not if you try to search over them.  At least not without serious work
> > to the insides of xapian itself, or some funky merging of results
> > from separate searches.
> 
> You could do it in the query, in simple cases, couldn't you? Or would
> it mess up ranks in the MSet?

So replace TERM1 with:

(en_stem(TERM1) OR fr_stem(TERM1) OR de_stem(TERM1))

The problem there is that stems may collide, and a french word could
match a totally unrelated english word which happens to have the same
stem.  I don't know if this is a big problem in reality.

You could add a boolean term to each database which specifies the
language and indexes every single document.  Then have N forms of
the query with different stemming for each, and FILTER with the
relevant language term.  Then XOR all those together (XOR should
be more efficient than an OR, since Xapian will know the maximum weight
is lower).

I remember a lot of fussing about this at Muscat, but I never really
heard any evidence that it was useful - it just seemed that "Muscat
Europe" thought they needed it to make sales.

> But users won't bother to tell you which language they're searching
> in. I suppose you could try to autodetect that, but it won't work well
> with typical short queries, I suspect.

I think really you need to make them bother, rather than trying to
cobble together some heath-robinson scheme which slows down every
search to allow them not to.

Cheers,
    Olly