[Xapian-devel] Backend for Lucene format indexes-How to get doclength

Richard Boulton richard at tartarus.org
Mon Jun 17 16:12:02 BST 2013


Ah, a quick follow-on from that: read
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html

There's a per-document "norm" which can be stored, which BM25Similarity
uses to store the document length.  Additional factors can be stored in
DocValuesFields (which are very similar to document values in Xapian, in
that they're stored in separate sequences, though are a bit more flexible).


On 17 June 2013 16:06, Richard Boulton <richard at tartarus.org> wrote:

> You might want to look at how Lucene has implemented document length
> lookup for the BM25Similarity class (added in Lucene 4.0):
>
>
> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html
>
> I assumed they're using a document payload for storing the lengths, but
> haven't looked into it.
>
>
> On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com> wrote:
>
>> *Or do you mean that it's one number per document whereas the other stats
>> are per database, so it's harder to store it?*
>>
>> yes, I mean this. It's a huge data. If a new doclength list(contains all
>> the doclength in a list, like chert)
>> is added by myself, I am concern about:
>> 1. This doclength list may be the bottlenect in this backend,
>> http://trac.xapian.org/ticket/326
>> 2. Change too much above Lucene file format, then it's hard to compare
>> performance between Xapian and Lucene
>>
>> Some ideas:
>> 1. Using rank algorithm without doclength, such as BM25Weight or
>> TradWeight without doclength, or tfidfWeight.
>>     If ranking results will be not good without doclength?
>> 2. Stores doclength in .prx payload when doing Lucene indexing.
>>
>> https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html
>>     http://searchhub.org/2009/08/05/getting-started-with-payloads/
>>     But this method has obvious drawback, it's not for general Lucene
>> index data, if doclength is not stored, this method
>>     doesn't works
>>
>>>
>>> Any suggestions?
>>
>> Regards
>>
>> _______________________________________________
>> Xapian-devel mailing list
>> Xapian-devel at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-devel
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/db6eb082/attachment.htm>


More information about the Xapian-devel mailing list