[Xapian-discuss] The position list table and index size (compared to Lucene)

Wed Nov 11 13:20:45 GMT 2009

Olly Betts wrote:
> On Mon, Nov 09, 2009 at 01:41:19PM +0100, Paul Boddie wrote:
>   
>> The "pathological" aspect of the indexes I've created is that few tokens
>> are discarded, which obviously bloats each index and is likely to
>> produce lots of position data.
>>     
>
> Ah, it sounds like (and the table sizes back this up) you have more
> positional data than is typical, and so suffer more from the current
> storage overhead there.
>   

Yes, we really take a hit on the positional data. ;-)

> It's good to avoid indexing junk, but then it's bad to not have indexed
> something which someone wants to search for, and sometimes one person's
> junk is another person's useful search term.
>
> So your approach isn't unreasonable, at least in some applications.
>   

The arguably naive approach being taken involves tokenising lots of 
stuff to a very precise level in order to expose tokens which would 
frequently be junk in many applications (symbols, punctuation, single 
letters, numbers) and then performing searches where such tokens are 
involved. That said, a review of such an approach is possibly in order: 
we could compromise on our index's accuracy by permitting false 
positives (for example, omitting tokens which might "obstruct" matches), 
but then we could eliminate such results when actually inspecting the 
stored field holding the original text.

I hope that doesn't sound too bizarre. Really, I'm seeking inspiration 
from something like a hashtable when I talk about compromising index 
accuracy: although it's desirable to distribute values in such data 
structures in order to provide optimal retrieval performance, the 
general case in retrieval has to consider the possibility of hash 
collisions. My "index thinning" strategy would take a similar 
pessimistic approach: the "hash" is a suspicion that a result is found 
in a particular document based on asking the index.

> There aren't any real improvements to the positional data storage in chert
> over flint.  The notable changes are to values and postlists.
>   

Right. Thanks for summarising this!

> Lucene uses a much more static storage system than Xapian, and relies on
> merging to perform incremental updates.  Xapian's storage is more dynamic
> which allows incremental updates to be made in place.  There are pros and
> cons to both approaches.
>
> We do compress common prefixes from terms in some places, but we could do
> more.
>   

In a sense, Xapian is probably a bit more suited to highly dynamic data, 
although I notice that Lucene now supposedly supports "real-time" 
features, whatever they are supposed to be. Sadly, our application 
doesn't really need such dynamic features.

[...]

> Do you have a Lucene database for comparable data?  If so it would be
> interesting to see the size breakdown for the files (you might have to
> look inside the .cfs file - I don't know if there's a tool to do that
> though...)
>   

The Lucene index structure isn't as "filesystem-friendly" as Xapian's. 
:-) I could try and use one of the tools, or maybe prod at an index via 
the API, to see what the distribution is like. In my own Lucene-inspired 
index structure (which is filesystem-friendly), I get a distribution 
like this:

Term dictionary (terms, frequencies, pointers to positions): ~1.6M
Positions (just documents and token positions): ~36M
Stored fields: ~36M

Here, the term and position lists have the benefit of total compaction - 
there's no scope for insertion or any dynamic updates - and these things 
would correspond to the lowest level of a B-tree-like data structure. 
There are also tables which are loaded into memory and used to navigate 
these files, but the largest (for the positions, of course) is less than 
2M, depending on how the table is configured, since it uses a naive 
"every nth entry" scheme for navigation into the actual positional data.

I'm not saying that the distribution in a real Lucene index would be 
identical to this, but it wouldn't surprise me if it were close to this.

Paul