[Xapian-tickets] [Xapian] #810: LMWeight implementation issues

Thu Aug 12 01:18:13 BST 2021

#810: LMWeight implementation issues
--------------------------------+------------------------
        Reporter:  Olly Betts   |      Owner:  Olly Betts
            Type:  defect       |     Status:  new
        Priority:  normal       |  Milestone:  1.5.0
       Component:  Library API  |    Version:  git master
        Severity:  normal       |   Keywords:
      Blocked By:               |   Blocking:
Operating System:  All          |
--------------------------------+------------------------
 I've just been implementing an improvement to always use the per-shard
 document length bounds in weight calculations (previously we used the
 global bounds for local shards, but the remote-specific bounds for remote
 shards).

 As part of this, I checked how the different weighting schemes use these
 bounds in case any expected them to be the same for every shard - and
 `LMWeight` does.  That was already wrong with remote shards, but with my
 change becomes wrong more often.

 We could make the global bounds available to address this, but looking
 more closely, I'm dubious about the implementation in more ways than just
 this.

 The weight score is a product of probabilities over terms in the query,
 which we convert to a sum (as required by our weighting machinery) by
 taking `log`.  So we calculate `log(weight_document)` which is
 `log(product(probability_term))` for terms in the query which is
 `sum(log(probability_term))`.

 There are two problems here.

 One is that I'm fairly sure we're meant to sum over all terms in the
 query, not just those that match the current document.  Not doing this is
 problematic because it means a document which matches fewer terms actually
 gets a higher score (because probabilities are in the range [0,1]).

 The other is that the log of a probability is always <= 0 and the term
 weight contributions have to be >= 0 in Xapian.

 To get around this we currently actually multiply each term's probability
 by `log_param` before taking the log (which is equivalent to adding a
 constant after the log).  The problem here is that this effectively adds a
 constant weight boost per matching term.  That at least tends to
 counteract the first issue, but overall we really aren't implementing the
 weighting scheme we claim to be.

 I notice that we could scale all document scores (in the original product
 form) by dividing by the product over all query terms of the probability
 of the term in the collection then for each document we can cancel the
 contributions for terms not matching that document and after taking the
 log we end up summing `log(probability_document/probability_collection)`
 over terms matching the document.

 I'd think probability_document (which unsmoothed is wdf/doclen and
 smoothed will be somewhere between that and probability_collection) is
 usually greater than probability_collection
 (collection_freq/total_doc_length), so the result of this log will usually
 be positive too.  If it isn't, then clamping to zero seems reasonable (as
 if this happens the term occurs in the document but less often than you'd
 expect by random chance).

 One thing we really need to do is track down exactly which paper(s) the
 formulae used are from - a code comment only rather vaguely says "as
 described by the early papers on LM by Bruce Croft".  I think it must be
 one which came after "A General Language Model for Information Retrieval"
 (Song and Croft 1999) - that does have a matching basic framework but
 doesn't describe all the smoothing options our implementation has, and it
 also relies on modelling the term distribution (which we don't do).

 Marking with 1.5.0, though I don't see this as a blocker - this isn't the
 default weighting scheme and I think all of the issues noted above exist
 in the original implementation from 2014.
-- 
Ticket URL: <https://trac.xapian.org/ticket/810>
Xapian <https://xapian.org/>
Xapian