[Xapian-devel] Omega changes

Olly Betts olly at survex.com
Fri Dec 17 18:18:10 GMT 2004


On Fri, Dec 17, 2004 at 05:04:21PM +0000, James Aylett wrote:
> How often do people remove documents?

Anywhere from never to all the time!

> If we make it delete only if
> told to, then you skip that step and save some time. And memory,
> actually - using a vector<bool> is fairly compact, but will start
> becoming significant in a very large corpus over time. If I'm never
> deleting documents I can save quite a bit of memory.

I bet the time saved is neglible.  Memory is slightly more of a concern
although I thought this through carefully when I devised the scheme...

Since vector<bool> should be specialised (it is with GCC), so it'll take
only be one bit per document id which has ever been used.  So for 1
million documents, you need just 122KB (and this scales linearly).

Any large gaps in the document id space will just page out to swap.

Also note that documents will often be rescanned in close to docid
order, because that's how they were added.  So the working set will
typically be small.

A good guiding rule is that Omega should do the sane thing by default.
If nothing else, it makes the quickstart guide easier to write.  So
tracking deleted documents is often required, and the overhead is
pretty small for a modest database.

A "don't delete" option is worth considering, though I do wonder if
the benefit might be lost in the noise.

> I'm expecting lots of the database to be successfully cached. Were you
> thinking of putting this in a value, or in the document data? The
> former might cache better in this context, but that isn't a terribly
> good argument for putting it there.

If we put it in a value, we get "sort by date" as if by magic.  So
that seems a good plan.  It also means if the search front end doesn't
use the timestamp, there's zero search-time overhead from this change.

> Or you could just auto-detect the
> language from their browser, and give them the option to change it. If
> it's prominent, that's not a bad solution at all.

This seems better to me.  Treat this as a UI design issue (which is
probably what it really is).

If we can devise a clean and well thought out approach to searching
multiple languages at once, great.  But let's do it for the right
reasons!

Cheers,
    Olly




More information about the Xapian-devel mailing list