[Xapian-tickets] [Xapian] #744: Merge tfidf-maxwdf-norm branch
Xapian
nobody at xapian.org
Tue Jun 9 19:44:28 BST 2020
#744: Merge tfidf-maxwdf-norm branch
-------------------------+-------------------------------
Reporter: Olly Betts | Owner: Olly Betts
Type: defect | Status: new
Priority: normal | Milestone: 1.5.0
Component: Library API | Version:
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+-------------------------------
Description changed by Olly Betts:
Old description:
> Nishad Dawkhar implemented the "maxwdf" norm for `TfIdfWeight`, which is
> on the tfidf-maxwdf-norm branch in git now.
>
> Because this changes the API of the Weight class (by adding a new
> parameter to get_sumpart()) this can't be merged in 1.4.x. I think it's
> better to hold off merging to master while these issues from before
> remain:
>
> * Remote backend support
> * Given we pass doclen and uniqterms to get_sumextra(), it would make
> sense to pass wdfdocmax to that too.
>
> I'm not 100% happy with the way we seem to need to add new parameters to
> get_sumpart() from time to time, because this means every Weight subclass
> needs updating (fixing those in the library is OK, but this also affects
> user-defined weighting schemes). I wonder if there's a clean and
> efficient way to avoid this (it needs to be efficient as this method can
> get called a lot). Or perhaps there are only so many per-doc stats, and
> this is only the second time we've needed to do this.
>
> It'd also be nice to store the wdfdocmax stats (and the uniqueterms
> stats) for all the documents in a chunked stream (like how document
> lengths are stored) - the code to work them out in this patch is correct,
> but requires scanning the termlist of each document we need this stat
> for, which is quite a lot of work.
New description:
Nishad Dawkhar implemented the "maxwdf" norm for `TfIdfWeight`, which is
on the [source:/@tfidf-maxwdf-norm tfidf-maxwdf-norm branch in git] now.
Because this changes the API of the Weight class (by adding a new
parameter to get_sumpart()) this can't be merged in 1.4.x. I think it's
better to hold off merging to master while these issues from before
remain:
* Remote backend support
* Given we pass doclen and uniqterms to get_sumextra(), it would make
sense to pass wdfdocmax to that too.
I'm not 100% happy with the way we seem to need to add new parameters to
get_sumpart() from time to time, because this means every Weight subclass
needs updating (fixing those in the library is OK, but this also affects
user-defined weighting schemes). I wonder if there's a clean and efficient
way to avoid this (it needs to be efficient as this method can get called
a lot). Or perhaps there are only so many per-doc stats, and this is only
the second time we've needed to do this.
It'd also be nice to store the wdfdocmax stats (and the uniqueterms stats)
for all the documents in a chunked stream (like how document lengths are
stored) - the code to work them out in this patch is correct, but requires
scanning the termlist of each document we need this stat for, which is
quite a lot of work.
--
--
Ticket URL: <https://trac.xapian.org/ticket/744#comment:1>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list