[Xapian-tickets] [Xapian] #810: LMWeight implementation issues
Xapian
nobody at xapian.org
Thu Aug 12 01:18:13 BST 2021
#810: LMWeight implementation issues
--------------------------------+------------------------
Reporter: Olly Betts | Owner: Olly Betts
Type: defect | Status: new
Priority: normal | Milestone: 1.5.0
Component: Library API | Version: git master
Severity: normal | Keywords:
Blocked By: | Blocking:
Operating System: All |
--------------------------------+------------------------
I've just been implementing an improvement to always use the per-shard
document length bounds in weight calculations (previously we used the
global bounds for local shards, but the remote-specific bounds for remote
shards).
As part of this, I checked how the different weighting schemes use these
bounds in case any expected them to be the same for every shard - and
`LMWeight` does. That was already wrong with remote shards, but with my
change becomes wrong more often.
We could make the global bounds available to address this, but looking
more closely, I'm dubious about the implementation in more ways than just
this.
The weight score is a product of probabilities over terms in the query,
which we convert to a sum (as required by our weighting machinery) by
taking `log`. So we calculate `log(weight_document)` which is
`log(product(probability_term))` for terms in the query which is
`sum(log(probability_term))`.
There are two problems here.
One is that I'm fairly sure we're meant to sum over all terms in the
query, not just those that match the current document. Not doing this is
problematic because it means a document which matches fewer terms actually
gets a higher score (because probabilities are in the range [0,1]).
The other is that the log of a probability is always <= 0 and the term
weight contributions have to be >= 0 in Xapian.
To get around this we currently actually multiply each term's probability
by `log_param` before taking the log (which is equivalent to adding a
constant after the log). The problem here is that this effectively adds a
constant weight boost per matching term. That at least tends to
counteract the first issue, but overall we really aren't implementing the
weighting scheme we claim to be.
I notice that we could scale all document scores (in the original product
form) by dividing by the product over all query terms of the probability
of the term in the collection then for each document we can cancel the
contributions for terms not matching that document and after taking the
log we end up summing `log(probability_document/probability_collection)`
over terms matching the document.
I'd think probability_document (which unsmoothed is wdf/doclen and
smoothed will be somewhere between that and probability_collection) is
usually greater than probability_collection
(collection_freq/total_doc_length), so the result of this log will usually
be positive too. If it isn't, then clamping to zero seems reasonable (as
if this happens the term occurs in the document but less often than you'd
expect by random chance).
One thing we really need to do is track down exactly which paper(s) the
formulae used are from - a code comment only rather vaguely says "as
described by the early papers on LM by Bruce Croft". I think it must be
one which came after "A General Language Model for Information Retrieval"
(Song and Croft 1999) - that does have a matching basic framework but
doesn't describe all the smoothing options our implementation has, and it
also relies on modelling the term distribution (which we don't do).
Marking with 1.5.0, though I don't see this as a blocker - this isn't the
default weighting scheme and I think all of the issues noted above exist
in the original implementation from 2014.
--
Ticket URL: <https://trac.xapian.org/ticket/810>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list