[Xapian-devel] Backend for Lucene format indexes-How to get doclength

Richard Boulton richard at tartarus.org
Mon Jun 17 16:06:36 BST 2013

You might want to look at how Lucene has implemented document length lookup
for the BM25Similarity class (added in Lucene 4.0):


I assumed they're using a document payload for storing the lengths, but
haven't looked into it.

On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com> wrote:

> *Or do you mean that it's one number per document whereas the other stats
> are per database, so it's harder to store it?*
> yes, I mean this. It's a huge data. If a new doclength list(contains all
> the doclength in a list, like chert)
> is added by myself, I am concern about:
> 1. This doclength list may be the bottlenect in this backend,
> http://trac.xapian.org/ticket/326
> 2. Change too much above Lucene file format, then it's hard to compare
> performance between Xapian and Lucene
> Some ideas:
> 1. Using rank algorithm without doclength, such as BM25Weight or
> TradWeight without doclength, or tfidfWeight.
>     If ranking results will be not good without doclength?
> 2. Stores doclength in .prx payload when doing Lucene indexing.
> https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html
>     http://searchhub.org/2009/08/05/getting-started-with-payloads/
>     But this method has obvious drawback, it's not for general Lucene
> index data, if doclength is not stored, this method
>     doesn't works
>> Any suggestions?
> Regards
> _______________________________________________
> Xapian-devel mailing list
> Xapian-devel at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/2054aace/attachment.htm>

More information about the Xapian-devel mailing list