[Xapian-discuss] The position list table and index size (compared to Lucene)

Wed Nov 11 04:41:02 GMT 2009

On Mon, Nov 09, 2009 at 01:41:19PM +0100, Paul Boddie wrote:
> The "pathological" aspect of the indexes I've created is that few tokens
> are discarded, which obviously bloats each index and is likely to
> produce lots of position data.

Ah, it sounds like (and the table sizes back this up) you have more
positional data than is typical, and so suffer more from the current
storage overhead there.

> I know that this is a "bad thing" where
> text indexing solutions are concerned, but the domain I'm working in
> doesn't easily permit me to discard tokens which look like junk.
> (Actually, there are approaches which involve identifying regions of
> "interesting" text using machine learning approaches, just indexing
> those regions, and then throwing everything else away, but we have
> chosen not to follow that approach.)

It's good to avoid indexing junk, but then it's bad to not have indexed
something which someone wants to search for, and sometimes one person's
junk is another person's useful search term.

So your approach isn't unreasonable, at least in some applications.

>> As it also mentions, the chert backend in 1.1.x is more compact than the
>> flint backend which is the default in 1.0.x.
>
> I did try the 1.1.x branch a few weeks ago, but the savings appeared to
> be negligible for my particular case.

There aren't any real improvements to the positional data storage in chert
over flint.  The notable changes are to values and postlists.

> I imagine that Xapian's index structure really is quite different from
> Lucene's, then, since Lucene seems to use a fairly flat index structure
> and manages to employ term name compression by encoding the length of
> the common prefix from one term to the next.

Yes, they really are rather different.

Lucene uses a much more static storage system than Xapian, and relies on
merging to perform incremental updates.  Xapian's storage is more dynamic
which allows incremental updates to be made in place.  There are pros and
cons to both approaches.

We do compress common prefixes from terms in some places, but we could do
more.

>> If you measure how much of the position table is actually the positional
>> data the percentage is surprisingly low.  I got 14% for gmane, though
>> that will certainly vary, but something like 86% is a mixture of keys,
>> Btree structure, and unused space.  We can't eliminate all of that (we
>> do need to somehow store what each stream of positional data is for),
>> but it shouldn't be hard to make a dramatic difference here.  This is
>> something I'm likely to be working on later this month.
>
> But how much positional data does GMane index now? My problem may well
> be related to the excess of positional data which most people would try
> and eliminate simply by reducing the number of indexed tokens. I'm
> prepared to accept this explanation, but it still leaves that awkward
> issue of how Lucene manages to handle similar volumes of positional data.

I imagine by avoiding much of that 86%.

Do you have a Lucene database for comparable data?  If so it would be
interesting to see the size breakdown for the files (you might have to
look inside the .cfs file - I don't know if there's a tool to do that
though...)

Cheers,
    Olly