[Xapian-devel] Backend for Lucene format indexes-How to get doclength

Richard Boulton richard at tartarus.org
Mon Jun 17 16:06:36 BST 2013


You might want to look at how Lucene has implemented document length lookup
for the BM25Similarity class (added in Lucene 4.0):

http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html

I assumed they're using a document payload for storing the lengths, but
haven't looked into it.


On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com> wrote:

> *Or do you mean that it's one number per document whereas the other stats
> are per database, so it's harder to store it?*
>
> yes, I mean this. It's a huge data. If a new doclength list(contains all
> the doclength in a list, like chert)
> is added by myself, I am concern about:
> 1. This doclength list may be the bottlenect in this backend,
> http://trac.xapian.org/ticket/326
> 2. Change too much above Lucene file format, then it's hard to compare
> performance between Xapian and Lucene
>
> Some ideas:
> 1. Using rank algorithm without doclength, such as BM25Weight or
> TradWeight without doclength, or tfidfWeight.
>     If ranking results will be not good without doclength?
> 2. Stores doclength in .prx payload when doing Lucene indexing.
>
> https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html
>     http://searchhub.org/2009/08/05/getting-started-with-payloads/
>     But this method has obvious drawback, it's not for general Lucene
> index data, if doclength is not stored, this method
>     doesn't works
>
>>
>> Any suggestions?
>
> Regards
>
> _______________________________________________
> Xapian-devel mailing list
> Xapian-devel at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/2054aace/attachment.htm>


More information about the Xapian-devel mailing list