[Xapian-tickets] [Xapian] #744: Merge tfidf-maxwdf-norm branch
Xapian
nobody at xapian.org
Wed Dec 7 06:11:31 GMT 2016
#744: Merge tfidf-maxwdf-norm branch
--------------------------------+-------------------
Reporter: olly | Owner: olly
Type: defect | Status: new
Priority: normal | Milestone: 1.5.0
Component: Library API | Version:
Severity: normal | Keywords:
Blocked By: | Blocking:
Operating System: All |
--------------------------------+-------------------
Nishad Dawkhar implemented the "maxwdf" norm for `TfIdfWeight`, which is
on the tfidf-maxwdf-norm branch in git now.
Because this changes the API of the Weight class (by adding a new
parameter to get_sumpart()) this can't be merged in 1.4.x. I think it's
better to hold off merging to master while these issues from before
remain:
* Remote backend support
* Given we pass doclen and uniqterms to get_sumextra(), it would make
sense to pass wdfdocmax to that too.
I'm not 100% happy with the way we seem to need to add new parameters to
get_sumpart() from time to time, because this means every Weight subclass
needs updating (fixing those in the library is OK, but this also affects
user-defined weighting schemes). I wonder if there's a clean and efficient
way to avoid this (it needs to be efficient as this method can get called
a lot). Or perhaps there are only so many per-doc stats, and this is only
the second time we've needed to do this.
It'd also be nice to store the wdfdocmax stats (and the uniqueterms stats)
for all the documents in a chunked stream (like how document lengths are
stored) - the code to work them out in this patch is correct, but requires
scanning the termlist of each document we need this stat for, which is
quite a lot of work.
--
Ticket URL: <https://trac.xapian.org/ticket/744>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list