[Xapian-tickets] [Xapian] #810: LMWeight implementation issues

Thu Aug 12 01:19:43 BST 2021

#810: LMWeight implementation issues
-------------------------+-------------------------------
 Reporter:  Olly Betts   |             Owner:  Olly Betts
     Type:  defect       |            Status:  new
 Priority:  normal       |         Milestone:  1.5.0
Component:  Library API  |           Version:  git master
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+-------------------------------
Description changed by Olly Betts:

Old description:

> I've just been implementing an improvement to always use the per-shard
> document length bounds in weight calculations (previously we used the
> global bounds for local shards, but the remote-specific bounds for remote
> shards).
>
> As part of this, I checked how the different weighting schemes use these
> bounds in case any expected them to be the same for every shard - and
> `LMWeight` does.  That was already wrong with remote shards, but with my
> change becomes wrong more often.
>
> We could make the global bounds available to address this, but looking
> more closely, I'm dubious about the implementation in more ways than just
> this.
>
> The weight score is a product of probabilities over terms in the query,
> which we convert to a sum (as required by our weighting machinery) by
> taking `log`.  So we calculate `log(weight_document)` which is
> `log(product(probability_term))` for terms in the query which is
> `sum(log(probability_term))`.
>
> There are two problems here.
>
> One is that I'm fairly sure we're meant to sum over all terms in the
> query, not just those that match the current document.  Not doing this is
> problematic because it means a document which matches fewer terms
> actually gets a higher score (because probabilities are in the range
> [0,1]).
>
> The other is that the log of a probability is always <= 0 and the term
> weight contributions have to be >= 0 in Xapian.
>
> To get around this we currently actually multiply each term's probability
> by `log_param` before taking the log (which is equivalent to adding a
> constant after the log).  The problem here is that this effectively adds
> a constant weight boost per matching term.  That at least tends to
> counteract the first issue, but overall we really aren't implementing the
> weighting scheme we claim to be.
>
> I notice that we could scale all document scores (in the original product
> form) by dividing by the product over all query terms of the probability
> of the term in the collection then for each document we can cancel the
> contributions for terms not matching that document and after taking the
> log we end up summing `log(probability_document/probability_collection)`
> over terms matching the document.
>
> I'd think probability_document (which unsmoothed is wdf/doclen and
> smoothed will be somewhere between that and probability_collection) is
> usually greater than probability_collection
> (collection_freq/total_doc_length), so the result of this log will
> usually be positive too.  If it isn't, then clamping to zero seems
> reasonable (as if this happens the term occurs in the document but less
> often than you'd expect by random chance).
>
> One thing we really need to do is track down exactly which paper(s) the
> formulae used are from - a code comment only rather vaguely says "as
> described by the early papers on LM by Bruce Croft".  I think it must be
> one which came after "A General Language Model for Information Retrieval"
> (Song and Croft 1999) - that does have a matching basic framework but
> doesn't describe all the smoothing options our implementation has, and it
> also relies on modelling the term distribution (which we don't do).
>
> Marking with 1.5.0, though I don't see this as a blocker - this isn't the
> default weighting scheme and I think all of the issues noted above exist
> in the original implementation from 2014.

New description:

 I've just been implementing an improvement to always use the per-shard
 document length bounds in weight calculations (previously we used the
 global bounds for local shards, but the remote-specific bounds for remote
 shards).

 As part of this, I checked how the different weighting schemes use these
 bounds in case any expected them to be the same for every shard - and
 `LMWeight` does.  That was already wrong with remote shards, but with my
 change becomes wrong more often.

 We could make the global bounds available to address this, but looking
 more closely, I'm dubious about the implementation in more ways than just
 this.

 The weight score is a product of probabilities over terms in the query,
 which we convert to a sum (as required by our weighting machinery) by
 taking `log`.  So we calculate `log(weight_document)` which is
 `log(product(probability_term))` for terms in the query which is
 `sum(log(probability_term))`.

 There are two problems here.

 One is that I'm fairly sure we're meant to sum over all terms in the
 query, not just those that match the current document.  Not doing this is
 problematic because it means a document which matches fewer terms actually
 gets a higher score (because probabilities are in the range [0, 1]).

 The other is that the log of a probability is always <= 0 and the term
 weight contributions have to be >= 0 in Xapian.

 To get around this we currently actually multiply each term's probability
 by `log_param` before taking the log (which is equivalent to adding a
 constant after the log).  The problem here is that this effectively adds a
 constant weight boost per matching term.  That at least tends to
 counteract the first issue, but overall we really aren't implementing the
 weighting scheme we claim to be.

 I notice that we could scale all document scores (in the original product
 form) by dividing by the product over all query terms of the probability
 of the term in the collection then for each document we can cancel the
 contributions for terms not matching that document and after taking the
 log we end up summing `log(probability_document/probability_collection)`
 over terms matching the document.

 I'd think probability_document (which unsmoothed is wdf/doclen and
 smoothed will be somewhere between that and probability_collection) is
 usually greater than probability_collection
 (collection_freq/total_doc_length), so the result of this log will usually
 be positive too.  If it isn't, then clamping to zero seems reasonable (as
 if this happens the term occurs in the document but less often than you'd
 expect by random chance).

 One thing we really need to do is track down exactly which paper(s) the
 formulae used are from - a code comment only rather vaguely says "as
 described by the early papers on LM by Bruce Croft".  I think it must be
 one which came after "A General Language Model for Information Retrieval"
 (Song and Croft 1999) - that does have a matching basic framework but
 doesn't describe all the smoothing options our implementation has, and it
 also relies on modelling the term distribution (which we don't do).

 Marking with 1.5.0, though I don't see this as a blocker - this isn't the
 default weighting scheme and I think all of the issues noted above exist
 in the original implementation from 2014.

--
-- 
Ticket URL: <https://trac.xapian.org/ticket/810#comment:1>
Xapian <https://xapian.org/>
Xapian