[Xapian-devel] Backend for Lucene format indexes-How to get doclength

Olly Betts olly at survex.com
Mon Jun 17 12:39:08 BST 2013


On Sun, Jun 16, 2013 at 12:32:31PM +0800, jiangwen jiang wrote:
> I have wrote a demo patch for Backend for Lucene format indexes, Lucene
> version is 3.6.2.
> http://lucene.apache.org/core/3_6_2/fileformats.html

Sounds cool.

> Until now, I found some data needed for BM25 in Xapian are not existed in
> Lucene:
> 1. doclength_lower_bound??doclength_upper_bound
> 2. wdf_lower_bound??wdf_uppper_bound
> 3. total_length
> 4. doclength(for each document)
> 1-3 are statistics data, can be caculated when doing copydatabase, and
> store them in somewhere. But doclengh is
> hard to do this way.

Xapian's doclength is defined as sum(wdf), so I think you should be able
to calculate it with a tool which scans the database in a
copydatabase-like manner.

Or do you mean that it's one number per document whereas the other stats
are per database, so it's harder to store it?

> 1. some other data instead of doclength?

I don't know what else you could use instead.

> 2. Xapian support other rank algorithm which does not need doclength?

Yes.  With certain parameter settings, BM25Weight and TradWeight don't
need doclength.  If you look in include/xapian/weight.h, you can see
when need_stat(DOC_LENGTH) is called:

BM25Weight:

        if (param_k1 != 0 && param_b != 0) need_stat(DOC_LENGTH); 

(so if you set k1=0 or b=0, BM25Weight won't use doclength).

TradWeight:

        if (param_k != 0.0) { 
	    need_stat(AVERAGE_LENGTH); 
	    need_stat(DOC_LENGTH); 
	} 

(so if k=0, TradWeight won't use doclength).

Also, TfIdfWeight (which is on trunk, and 1.3.1) never uses doclength.

> And the demo patch is here:
> https://github.com/white127/xapian-patch/blob/master/xapian_lucene_demo.patch

Thanks - I'll take a look.

Cheers,
    Olly



More information about the Xapian-devel mailing list