[Xapian-devel] the influence of the change of the doclen chunk format

Olly Betts olly at survex.com
Fri May 23 02:32:04 BST 2014


On Wed, May 21, 2014 at 10:14:15PM +0800, Hurricane Tong wrote:
> I am going to make a patch applying new doclen chunk format.
> But I can't figure out the influence of the change.
> Olly once said the procedure of matching will use the doclen postlist.
> How can I find all the components that use the doclen postlist?

I would suggest just profiling some searches.  The default weighting
uses the doclength, and that's the case which Richard looked at in #326.

There are some particular cases which use the doclen list more heavily,
but these don't really need to be able to skip through it efficiently.

> As the fixed-width format is designed for contiguous docids, 
> I just want to apply this format to doclen postlist, rather than
> ordinary term postlist.
> So I can't just change the PostlistChunkReader.

Currently the doclength encoding just shares stuff with the postlist
encoding for convenience, but you'll need to provide a different handler
for it now.

> And how can I get some real doclen data to test the performance of the
> new format?
> Or I just generate some data randomly.

Randomly generated data tends to have different characteristics to real
data, so it's better to avoid it for this sort of thing, as you can end
up optimising for the wrong things.

I've put two sets of doclen data from real databases here, and a third
one will appear shortly (it's taking a while to copy over):

http://oligarchy.co.uk/xapian/data/doclens/

Format is (docid, doclen) in ascending docid order.

The archives collection is from indexed files, and has 101440 documents
present out of 102433 used docids.

The email collection has more missing docids, partly due to spam
deletion, but also refiling an email gets handled as a deletion and
addition in this system - that has 20934 used out of 33384.  I would
say this one is probably unusually sparse.

The gmane collection (which is on its way) has no missing docids, and
has 114086700 docids.

I dumped these out using:

xapian-delve /path/to/db -t '' -1 -v|awk '($1 != "Posting"){print $1" "$3}'

I can get more data if you need more.

Cheers,
    Olly



More information about the Xapian-devel mailing list