[Xapian-devel] Re: [Xapian-commits] 6355: trunk/xapian-applications/omega/ trunk/xapian-applications/omega/docs/

Olly Betts olly at survex.com
Tue Oct 18 07:07:34 BST 2005


On Fri, Jul 29, 2005 at 10:08:13AM +0100, james wrote:
> SVN root:       svn://svn.xapian.org/xapian
> Changes by:     james
> Revision:       6355
> Date:           2005-07-29 10:08:13 +0100 (Fri, 29 Jul 2005)
> 
> Log message (6 lines):
> omindex.cc: add --preserve-nonduplicates / -p option to not delete any
> documents that aren't updated, in replace duplicates mode (so that
> multiple runs of omindex on different subsites don't stomp on each
> other).

This fix seems to be avoiding the real issue, so it's less than ideal I
feel.

Looking at the code, what it's really doing is turning off half of
"skip_duplicates" - the bit at the end of the run where we delete
any documents we've not seen (on the assumption that they've been
deleted from the document tree since the previous index run).

(Although I notice it still creates and updates the bitmap we use to
track deleted documents, but that's easy enough to fix...)

The half of "skip_duplicates" which it leaves enabled is the code to
replace documents which have the same URL (rather than not updating them
as "skip_duplicates" does).

The motivation for this option is as described in the log message
above, and this is a genuine problem with my deleted document removal
code.  But if I have multiple subsites, deleted documents should still
get removed from the index, which is why I don't think this is the right
approach.

Arranging to delete the right documents might not be too hard.  All
documents for a particular subsite are indexed by the same H and P term
combination so we can just check each deletion candidate against those
two postlists (hurrah for skip_to!)  That should be pretty efficient.

The only problem I can see is that if indexroot is specified, we also
need to check each remaining deletion candidate against that, which I
think means we have to look in the document data for each one.  Ick,
that's probably going to be slow.  Or can anyone can see a way around
this issue?

We could just outlaw such partial updates, but that's probably
unreasonable.  Perhaps disabling deletion in that case would do for now.
At least it's a more unusual situation and it doesn't need a special
switch.

The other approach I can see is to move to having a configuration file
which describes what the index should contain.  Then omindex would be
able to process all subsites in one pass, and so the "updated" map would
be correct.  It also has the benefit that removing a whole subsite
works.  However updates of single subsites, or sections of subsites still
look like they'd be awkward, so this doesn't seem to address the hard
part of the problem.

Actually, I also wonder if even skip_duplicates should really be
disabling the deletion.  It would be easy and pretty cheap to look up
the document id for each skipped document and flag it as "updated" so it
didn't get deleted...  I think the reason it currently doesn't is just
an oversight on my part.

Thoughts?  It would be good to sort this out for 0.9.3, which I'm
starting to think about.

Cheers,
    Olly




More information about the Xapian-devel mailing list