[Xapian-devel] Re: [Xapian-commits] 6355: trunk/xapian-applications/omega/ trunk/xapian-applications/omega/docs/

Tue Oct 18 11:47:48 BST 2005

On Tue, Oct 18, 2005 at 07:07:34AM +0100, Olly Betts wrote:

> > omindex.cc: add --preserve-nonduplicates / -p option to not delete any
> > documents that aren't updated, in replace duplicates mode (so that
> > multiple runs of omindex on different subsites don't stomp on each
> > other).
> 
> This fix seems to be avoiding the real issue, so it's less than ideal I
> feel.

I think the real issue is that omindex is trying to model two
different ways of looking at the world, one simple (without subsites)
and one more complex but very, very specific (with subsites). It tends
to bite new users, and it requires quite different bits of code to
handle the different options - yet it's all smershed together in the
hope that it'll all be okay, basically because of people gradually
adding features (first me, to support something I needed, then you for
something else). Currently omindex embodies TIMTOWTDI, which probably
isn't ideal for an out-of-the-box basic search system.

> Looking at the code, what it's really doing is turning off half of
> "skip_duplicates" - the bit at the end of the run where we delete
> any documents we've not seen (on the assumption that they've been
> deleted from the document tree since the previous index run).
> 
> (Although I notice it still creates and updates the bitmap we use to
> track deleted documents, but that's easy enough to fix...)

I just needed a quick fix for someone :-)

[The new code will not delete old documents in subsites]
> Arranging to delete the right documents might not be too hard.  All
> documents for a particular subsite are indexed by the same H and P term
> combination so we can just check each deletion candidate against those
> two postlists (hurrah for skip_to!)  That should be pretty efficient.

Yeah, I was just lazy. 

> The only problem I can see is that if indexroot is specified, we also
> need to check each remaining deletion candidate against that, which I
> think means we have to look in the document data for each one.  Ick,
> that's probably going to be slow.  Or can anyone can see a way around
> this issue?

Changing the way omindex works to drop indexroot and make it all a lot
more obvious? This is a serious suggestion, by the way - I'm pretty
sure we can come up with a better model for omindex that doesn't
confuse the hell out of people when they first meet it.

> We could just outlaw such partial updates, but that's probably
> unreasonable.  Perhaps disabling deletion in that case would do for now.
> At least it's a more unusual situation and it doesn't need a special
> switch.

Providing we document it clearly, that would probably be fine.

> The other approach I can see is to move to having a configuration file
> which describes what the index should contain.  Then omindex would be
> able to process all subsites in one pass, and so the "updated" map would
> be correct.  It also has the benefit that removing a whole subsite
> works.  However updates of single subsites, or sections of subsites still
> look like they'd be awkward, so this doesn't seem to address the hard
> part of the problem.

Thinking quickly, but how about an omindex that uses a config file
which lists the subsites, where they are on disk and so forth *but*
omindex can also work in "simple" mode, without subsites, without a
configuration file at all.

I think that means that the awkwardness sits entirely within the code,
and deals with our new user problem (which is that omindex isn't the
easiest thing to drive; and indeed I tend to have to read the code to
remind myself of fiddly details).

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org