[Xapian-tickets] [Xapian] #744: Merge tfidf-maxwdf-norm branch

Xapian nobody at xapian.org
Tue Jun 9 19:44:28 BST 2020


#744: Merge tfidf-maxwdf-norm branch
-------------------------+-------------------------------
 Reporter:  Olly Betts   |             Owner:  Olly Betts
     Type:  defect       |            Status:  new
 Priority:  normal       |         Milestone:  1.5.0
Component:  Library API  |           Version:
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+-------------------------------
Description changed by Olly Betts:

Old description:

> Nishad Dawkhar implemented the "maxwdf" norm for `TfIdfWeight`, which is
> on the tfidf-maxwdf-norm branch in git now.
>
> Because this changes the API of the Weight class (by adding a new
> parameter to get_sumpart()) this can't be merged in 1.4.x. I think it's
> better to hold off merging to master while these issues from before
> remain:
>
>  * Remote backend support
>  * Given we pass doclen and uniqterms to get_sumextra(), it would make
> sense to pass wdfdocmax to that too.
>
> I'm not 100% happy with the way we seem to need to add new parameters to
> get_sumpart() from time to time, because this means every Weight subclass
> needs updating (fixing those in the library is OK, but this also affects
> user-defined weighting schemes). I wonder if there's a clean and
> efficient way to avoid this (it needs to be efficient as this method can
> get called a lot). Or perhaps there are only so many per-doc stats, and
> this is only the second time we've needed to do this.
>
> It'd also be nice to store the wdfdocmax stats (and the uniqueterms
> stats) for all the documents in a chunked stream (like how document
> lengths are stored) - the code to work them out in this patch is correct,
> but requires scanning the termlist of each document we need this stat
> for, which is quite a lot of work.

New description:

 Nishad Dawkhar implemented the "maxwdf" norm for `TfIdfWeight`, which is
 on the [source:/@tfidf-maxwdf-norm tfidf-maxwdf-norm branch in git] now.

 Because this changes the API of the Weight class (by adding a new
 parameter to get_sumpart()) this can't be merged in 1.4.x. I think it's
 better to hold off merging to master while these issues from before
 remain:

  * Remote backend support
  * Given we pass doclen and uniqterms to get_sumextra(), it would make
 sense to pass wdfdocmax to that too.

 I'm not 100% happy with the way we seem to need to add new parameters to
 get_sumpart() from time to time, because this means every Weight subclass
 needs updating (fixing those in the library is OK, but this also affects
 user-defined weighting schemes). I wonder if there's a clean and efficient
 way to avoid this (it needs to be efficient as this method can get called
 a lot). Or perhaps there are only so many per-doc stats, and this is only
 the second time we've needed to do this.

 It'd also be nice to store the wdfdocmax stats (and the uniqueterms stats)
 for all the documents in a chunked stream (like how document lengths are
 stored) - the code to work them out in this patch is correct, but requires
 scanning the termlist of each document we need this stat for, which is
 quite a lot of work.

--
-- 
Ticket URL: <https://trac.xapian.org/ticket/744#comment:1>
Xapian <https://xapian.org/>
Xapian


More information about the Xapian-tickets mailing list