[Xapian-discuss] The position list table and index size (compared to Lucene)
Paul Boddie
paul.boddie at biotek.uio.no
Wed Nov 11 13:20:45 GMT 2009
Olly Betts wrote:
> On Mon, Nov 09, 2009 at 01:41:19PM +0100, Paul Boddie wrote:
>
>> The "pathological" aspect of the indexes I've created is that few tokens
>> are discarded, which obviously bloats each index and is likely to
>> produce lots of position data.
>>
>
> Ah, it sounds like (and the table sizes back this up) you have more
> positional data than is typical, and so suffer more from the current
> storage overhead there.
>
Yes, we really take a hit on the positional data. ;-)
> It's good to avoid indexing junk, but then it's bad to not have indexed
> something which someone wants to search for, and sometimes one person's
> junk is another person's useful search term.
>
> So your approach isn't unreasonable, at least in some applications.
>
The arguably naive approach being taken involves tokenising lots of
stuff to a very precise level in order to expose tokens which would
frequently be junk in many applications (symbols, punctuation, single
letters, numbers) and then performing searches where such tokens are
involved. That said, a review of such an approach is possibly in order:
we could compromise on our index's accuracy by permitting false
positives (for example, omitting tokens which might "obstruct" matches),
but then we could eliminate such results when actually inspecting the
stored field holding the original text.
I hope that doesn't sound too bizarre. Really, I'm seeking inspiration
from something like a hashtable when I talk about compromising index
accuracy: although it's desirable to distribute values in such data
structures in order to provide optimal retrieval performance, the
general case in retrieval has to consider the possibility of hash
collisions. My "index thinning" strategy would take a similar
pessimistic approach: the "hash" is a suspicion that a result is found
in a particular document based on asking the index.
> There aren't any real improvements to the positional data storage in chert
> over flint. The notable changes are to values and postlists.
>
Right. Thanks for summarising this!
> Lucene uses a much more static storage system than Xapian, and relies on
> merging to perform incremental updates. Xapian's storage is more dynamic
> which allows incremental updates to be made in place. There are pros and
> cons to both approaches.
>
> We do compress common prefixes from terms in some places, but we could do
> more.
>
In a sense, Xapian is probably a bit more suited to highly dynamic data,
although I notice that Lucene now supposedly supports "real-time"
features, whatever they are supposed to be. Sadly, our application
doesn't really need such dynamic features.
[...]
> Do you have a Lucene database for comparable data? If so it would be
> interesting to see the size breakdown for the files (you might have to
> look inside the .cfs file - I don't know if there's a tool to do that
> though...)
>
The Lucene index structure isn't as "filesystem-friendly" as Xapian's.
:-) I could try and use one of the tools, or maybe prod at an index via
the API, to see what the distribution is like. In my own Lucene-inspired
index structure (which is filesystem-friendly), I get a distribution
like this:
Term dictionary (terms, frequencies, pointers to positions): ~1.6M
Positions (just documents and token positions): ~36M
Stored fields: ~36M
Here, the term and position lists have the benefit of total compaction -
there's no scope for insertion or any dynamic updates - and these things
would correspond to the lowest level of a B-tree-like data structure.
There are also tables which are loaded into memory and used to navigate
these files, but the largest (for the positions, of course) is less than
2M, depending on how the table is configured, since it uses a naive
"every nth entry" scheme for navigation into the actual positional data.
I'm not saying that the distribution in a real Lucene index would be
identical to this, but it wouldn't surprise me if it were close to this.
Paul
More information about the Xapian-discuss
mailing list