[Xapian-tickets] [Xapian] #744: Merge tfidf-maxwdf-norm branch

Xapian nobody at xapian.org
Wed Dec 7 06:11:31 GMT 2016


#744: Merge tfidf-maxwdf-norm branch
--------------------------------+-------------------
        Reporter:  olly         |      Owner:  olly
            Type:  defect       |     Status:  new
        Priority:  normal       |  Milestone:  1.5.0
       Component:  Library API  |    Version:
        Severity:  normal       |   Keywords:
      Blocked By:               |   Blocking:
Operating System:  All          |
--------------------------------+-------------------
 Nishad Dawkhar implemented the "maxwdf" norm for `TfIdfWeight`, which is
 on the tfidf-maxwdf-norm branch in git now.

 Because this changes the API of the Weight class (by adding a new
 parameter to get_sumpart()) this can't be merged in 1.4.x. I think it's
 better to hold off merging to master while these issues from before
 remain:

  * Remote backend support
  * Given we pass doclen and uniqterms to get_sumextra(), it would make
 sense to pass wdfdocmax to that too.

 I'm not 100% happy with the way we seem to need to add new parameters to
 get_sumpart() from time to time, because this means every Weight subclass
 needs updating (fixing those in the library is OK, but this also affects
 user-defined weighting schemes). I wonder if there's a clean and efficient
 way to avoid this (it needs to be efficient as this method can get called
 a lot). Or perhaps there are only so many per-doc stats, and this is only
 the second time we've needed to do this.

 It'd also be nice to store the wdfdocmax stats (and the uniqueterms stats)
 for all the documents in a chunked stream (like how document lengths are
 stored) - the code to work them out in this patch is correct, but requires
 scanning the termlist of each document we need this stat for, which is
 quite a lot of work.

--
Ticket URL: <https://trac.xapian.org/ticket/744>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list