[Xapian-devel] Backend for Lucene format indexes-How to get doclength

jiangwen jiang jiangwen127 at gmail.com
Sun Sep 15 05:06:39 BST 2013


code is updated now. please see the latest code.
also, copy-lucenedatabase.cc is added, to caculate wdf_upper_bound, which
is stored in a new file stat.xapian.
TfidfWeight is used.

Regards


2013/9/3 jiangwen jiang <jiangwen127 at gmail.com>

> Collection frequency means how many times a particular term appears in all
> docs, this data is not exists in Lucene backends(I will check it in lucene
> mailing list later).
> Termfreq(how many docs contains a particular term) is the most similar
> data to collection freq, but I don't think collection freq can be
> instead of termfreq.
> Now I am trying to caculate this data in copydatabase.
>
> Thanks
> Regards
>
>
>
> 2013/9/2 Olly Betts <olly at survex.com>
>
>> On Mon, Sep 02, 2013 at 09:21:48AM +0800, jiangwen jiang wrote:
>> > TfIdfWeight and BM25(b=0) also need wdf_upper_bound, it is not exists in
>> > Lucene backends.
>>
>> If you don't provide an implementation of wdf_upper_bound(), the default
>> is to use the collection frequency of the term, so provided that
>> information is available in the lucene files, the lack of
>> wdf_upper_bound information isn't a show stopper.
>>
>> > I think this data will be caculated when doing copydatabase, I will
>> update
>> > the code later
>>
>> That's probably a good plan though.
>>
>> Cheers,
>>     Olly
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130915/23dec86f/attachment.html>


More information about the Xapian-devel mailing list