[Xapian-devel] Omega changes

Fri Dec 17 15:55:03 GMT 2004

On Fri, Dec 17, 2004 at 03:31:47PM +0000, Olly Betts wrote:

> The other function that --duplicates=ignore serves is to allow you to
> index documents which aren't all under the same directory.  For example,
> you could index "/var/www" -> "/", then loop through users and add in
> "/home/$USER/public_html" -> "/~$USER/".

If you get duplicates while doing that, you want to overwrite, not
ignore, because (in a normal Apache config, for instance), having
/var/www/~james/ will be ignored in favour of /home/james/public_html
(or whatever). So actually you want the reindex there. Although what
you actually want is not to put shadowed directories in your
DocumentRoot, because that's confusing and somewhat crazy :-)

Is there something else I'm missing?

> To be honest, I think this belongs in "omindex.conf".  That way it's
> much easier to update the database without slipping up - otherwise
> you really need a shell script to remember the sequence of actions
> for you.

Yeah, providing we can reindex bits of sites without needing
configuration files. I'm not a big fan of configuration files that
only have two or three options in them, to be honest. And if we have
to have them, let's provide a way of passing config directives on the
command line.

> If that's addressed, the only real reason to keep anything like
> "--duplicates=ignore" is it might be faster in the case where you know
> you've not modified any documents, only added (or removed) some.  Then
> you can save a call to stat and reading the document data (or a
> value) for each unchanged document.  It might be worth profiling
> that before removing this entirely.  If it is significant, then I'd
> suggest we provide an option to force omindex to assume an existing
> document won't be modified - but it should probably still delete it
> from the index if it isn't on disk (which is different to
> "--duplicates=ignore").

Yeek. I've just looked at the deletion code, and I'm not convinced it
works. It seems to delete everything it doesn't update, rather than
everything it doesn't find: in index_file() if dupes==DUPE_ignore and
it's a dupe, we just return, rather than setting updated[doc_id] (of
course, we don't get doc_id in the test, for speed, so that would be a
pain to do).

(We're relying on document IDs never being reused within a single
database, which I know is okay, but I can't find it documented
anywhere as being true, which it probably should be. Maybe I'm looking
in the wrong place - I'd expect a note in the WritableDatabase docs.)

If we want to remove --duplicates=ignore, we'll have to profile that
against different OS/FS combinations. While --duplicates=ignore will
certainly be faster, I can imagine quite a variance of stat/compare on
different systems. Probably better to leave it in somehow, probably by
replacing --duplicates with --ignore-duplicates.

> > However: how would you cope with this with two databases with
> > different indexing options? Specifically, is there anything sane we
> > can do with different stemmers in use?
> 
> Not if you try to search over them.  At least not without serious work
> to the insides of xapian itself, or some funky merging of results
> from separate searches.

You could do it in the query, in simple cases, couldn't you? Or would
it mess up ranks in the MSet?

> I don't think it's actually terribly useful to search databases in
> different languages.  Only proper names are going to usefully feature
> across them, and even those often vary (e.g. "London" vs "Londres",
> "Rome" vs "Roma").

But users won't bother to tell you which language they're searching
in. I suppose you could try to autodetect that, but it won't work well
with typical short queries, I suspect.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org