[Xapian-devel] Backend for Lucene format indexes-How to get doclength
Olly Betts
olly at survex.com
Mon Jun 17 12:39:08 BST 2013
On Sun, Jun 16, 2013 at 12:32:31PM +0800, jiangwen jiang wrote:
> I have wrote a demo patch for Backend for Lucene format indexes, Lucene
> version is 3.6.2.
> http://lucene.apache.org/core/3_6_2/fileformats.html
Sounds cool.
> Until now, I found some data needed for BM25 in Xapian are not existed in
> Lucene:
> 1. doclength_lower_bound??doclength_upper_bound
> 2. wdf_lower_bound??wdf_uppper_bound
> 3. total_length
> 4. doclength(for each document)
> 1-3 are statistics data, can be caculated when doing copydatabase, and
> store them in somewhere. But doclengh is
> hard to do this way.
Xapian's doclength is defined as sum(wdf), so I think you should be able
to calculate it with a tool which scans the database in a
copydatabase-like manner.
Or do you mean that it's one number per document whereas the other stats
are per database, so it's harder to store it?
> 1. some other data instead of doclength?
I don't know what else you could use instead.
> 2. Xapian support other rank algorithm which does not need doclength?
Yes. With certain parameter settings, BM25Weight and TradWeight don't
need doclength. If you look in include/xapian/weight.h, you can see
when need_stat(DOC_LENGTH) is called:
BM25Weight:
if (param_k1 != 0 && param_b != 0) need_stat(DOC_LENGTH);
(so if you set k1=0 or b=0, BM25Weight won't use doclength).
TradWeight:
if (param_k != 0.0) {
need_stat(AVERAGE_LENGTH);
need_stat(DOC_LENGTH);
}
(so if k=0, TradWeight won't use doclength).
Also, TfIdfWeight (which is on trunk, and 1.3.1) never uses doclength.
> And the demo patch is here:
> https://github.com/white127/xapian-patch/blob/master/xapian_lucene_demo.patch
Thanks - I'll take a look.
Cheers,
Olly
More information about the Xapian-devel
mailing list