[Xapian-devel] Proposed changes to omindex

Olly Betts olly at survex.com
Tue Aug 29 00:06:53 BST 2006


On Sun, Aug 27, 2006 at 07:47:10PM +0100, James Aylett wrote:
> On Sun, Aug 27, 2006 at 07:00:44PM +0100, Olly Betts wrote:
> 
> > It might be better to put this directory on the wiki anyway - it's the
> > sort of thing we created the wiki for, and it would allow people to just
> > add their own entries.  Then your job would just be to make sure that
> > things stay tidy and sort out links which go dead.
> 
> I have a thing against projects that insist on using a wiki for
> permanent documentation. It just never feels very professional, and is
> often difficult to keep neat. Of course if I'm triaging it that's less
> of an issue.

We aren't talking about permanent documentation.  It's more like a
directory of contributed bookmarks.  I think it's exactly the sort of
thing which a wiki works well for.

But I'm not sure that it's the use of a wiki that is the real problem in
such cases.  It's generally that there's simply not enough editorial
control.

For example, Wikipedia manages to maintain a vast amount of what is
essentially documentation with relatively few problems because there are
enough people who care going round and keeping things tidy and
consistent.

> In my view the "right" solution *will* be to use the wiki as a highly
> mutable playpen for the documentation, links etc., and then use a CMS
> to manage the website. (Still not sure where this leaves the core
> documentation that needs to be exported as a book, although some CMSs
> have suitable functionality.)

I wonder if a CMS isn't overkill for what we need, but perhaps CMS
conjures up a different image to me than to you...

> > > We *could* start with 4 bytes and then auto-upgrade. Not sure if the
> > > space saving over 8 bytes is actually worth the hassle of maintaining
> > > BC code after 2038 though.
> > 
> > The auto-upgrade would be rather painful for a large database (though to
> > be honest I'd be astonished if we don't have an incompatible database
> > format change in the next 32 years anyway), which is why I suggested we
> > might want to put the extra byte in ahead of time.
> 
> By auto-upgrade, I *don't* mean upgrading the database, I mean
> transparency to the end user, ie your readers (omindex, primarily) can
> cope with either 4-byte or 8-byte, and upgrade as they update
> documents.

But 4-byte and 8-byte strings won't compare correctly, so you can't
suddenly start adding 8-byte strings to a database full of 4-byte ones.
So you need to convert all the 4-byte values first.  Or implement custom
sort orders in the matcher.

> > But omindex isn't this general - it indexes files forming a website.
> > There's nothing to stop people who are indexing from other sources
> > (whether with scriptindex or a custom indexer) adding a source
> > identifier if they find it useful, but let's consider whether it's
> > generally useful for omindex to do it rather than looking at other
> > situations.
> 
> Okay, but if omindex added the file path as the source identifier, I
> can see how that would be useful. In particular, if you (for some
> reason) batch delete files, it's an awful lot quicker than using
> omindex to reindex the entire system to get rid of them from xapian.

But you really don't want a field in the document data for that, since
you'll have to read the document data for every document in the database
which will be really slow.  It could well be faster to just rerun
omindex once it checks last modified times (actually we could add a
"purge" mode which just removed any documents which no longer exist).

If you have a list of files you want to remove from the index, then
the best approach is probably to just run the list through the URL
mapping and then delete documents using the resulting URL terms.

If you've removed whole directories, you don't even need the full list
of files - you can use Database::allterms_begin() and skip_to() the URL
term generated for the directory removed, then read all the URL terms
for files under that directory.

Cheers,
    Olly



More information about the Xapian-devel mailing list