[Xapian-tickets] [Xapian] #810: LMWeight implementation issues
Xapian
nobody at xapian.org
Thu Aug 12 01:19:43 BST 2021
#810: LMWeight implementation issues
-------------------------+-------------------------------
Reporter: Olly Betts | Owner: Olly Betts
Type: defect | Status: new
Priority: normal | Milestone: 1.5.0
Component: Library API | Version: git master
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+-------------------------------
Description changed by Olly Betts:
Old description:
> I've just been implementing an improvement to always use the per-shard
> document length bounds in weight calculations (previously we used the
> global bounds for local shards, but the remote-specific bounds for remote
> shards).
>
> As part of this, I checked how the different weighting schemes use these
> bounds in case any expected them to be the same for every shard - and
> `LMWeight` does. That was already wrong with remote shards, but with my
> change becomes wrong more often.
>
> We could make the global bounds available to address this, but looking
> more closely, I'm dubious about the implementation in more ways than just
> this.
>
> The weight score is a product of probabilities over terms in the query,
> which we convert to a sum (as required by our weighting machinery) by
> taking `log`. So we calculate `log(weight_document)` which is
> `log(product(probability_term))` for terms in the query which is
> `sum(log(probability_term))`.
>
> There are two problems here.
>
> One is that I'm fairly sure we're meant to sum over all terms in the
> query, not just those that match the current document. Not doing this is
> problematic because it means a document which matches fewer terms
> actually gets a higher score (because probabilities are in the range
> [0,1]).
>
> The other is that the log of a probability is always <= 0 and the term
> weight contributions have to be >= 0 in Xapian.
>
> To get around this we currently actually multiply each term's probability
> by `log_param` before taking the log (which is equivalent to adding a
> constant after the log). The problem here is that this effectively adds
> a constant weight boost per matching term. That at least tends to
> counteract the first issue, but overall we really aren't implementing the
> weighting scheme we claim to be.
>
> I notice that we could scale all document scores (in the original product
> form) by dividing by the product over all query terms of the probability
> of the term in the collection then for each document we can cancel the
> contributions for terms not matching that document and after taking the
> log we end up summing `log(probability_document/probability_collection)`
> over terms matching the document.
>
> I'd think probability_document (which unsmoothed is wdf/doclen and
> smoothed will be somewhere between that and probability_collection) is
> usually greater than probability_collection
> (collection_freq/total_doc_length), so the result of this log will
> usually be positive too. If it isn't, then clamping to zero seems
> reasonable (as if this happens the term occurs in the document but less
> often than you'd expect by random chance).
>
> One thing we really need to do is track down exactly which paper(s) the
> formulae used are from - a code comment only rather vaguely says "as
> described by the early papers on LM by Bruce Croft". I think it must be
> one which came after "A General Language Model for Information Retrieval"
> (Song and Croft 1999) - that does have a matching basic framework but
> doesn't describe all the smoothing options our implementation has, and it
> also relies on modelling the term distribution (which we don't do).
>
> Marking with 1.5.0, though I don't see this as a blocker - this isn't the
> default weighting scheme and I think all of the issues noted above exist
> in the original implementation from 2014.
New description:
I've just been implementing an improvement to always use the per-shard
document length bounds in weight calculations (previously we used the
global bounds for local shards, but the remote-specific bounds for remote
shards).
As part of this, I checked how the different weighting schemes use these
bounds in case any expected them to be the same for every shard - and
`LMWeight` does. That was already wrong with remote shards, but with my
change becomes wrong more often.
We could make the global bounds available to address this, but looking
more closely, I'm dubious about the implementation in more ways than just
this.
The weight score is a product of probabilities over terms in the query,
which we convert to a sum (as required by our weighting machinery) by
taking `log`. So we calculate `log(weight_document)` which is
`log(product(probability_term))` for terms in the query which is
`sum(log(probability_term))`.
There are two problems here.
One is that I'm fairly sure we're meant to sum over all terms in the
query, not just those that match the current document. Not doing this is
problematic because it means a document which matches fewer terms actually
gets a higher score (because probabilities are in the range [0, 1]).
The other is that the log of a probability is always <= 0 and the term
weight contributions have to be >= 0 in Xapian.
To get around this we currently actually multiply each term's probability
by `log_param` before taking the log (which is equivalent to adding a
constant after the log). The problem here is that this effectively adds a
constant weight boost per matching term. That at least tends to
counteract the first issue, but overall we really aren't implementing the
weighting scheme we claim to be.
I notice that we could scale all document scores (in the original product
form) by dividing by the product over all query terms of the probability
of the term in the collection then for each document we can cancel the
contributions for terms not matching that document and after taking the
log we end up summing `log(probability_document/probability_collection)`
over terms matching the document.
I'd think probability_document (which unsmoothed is wdf/doclen and
smoothed will be somewhere between that and probability_collection) is
usually greater than probability_collection
(collection_freq/total_doc_length), so the result of this log will usually
be positive too. If it isn't, then clamping to zero seems reasonable (as
if this happens the term occurs in the document but less often than you'd
expect by random chance).
One thing we really need to do is track down exactly which paper(s) the
formulae used are from - a code comment only rather vaguely says "as
described by the early papers on LM by Bruce Croft". I think it must be
one which came after "A General Language Model for Information Retrieval"
(Song and Croft 1999) - that does have a matching basic framework but
doesn't describe all the smoothing options our implementation has, and it
also relies on modelling the term distribution (which we don't do).
Marking with 1.5.0, though I don't see this as a blocker - this isn't the
default weighting scheme and I think all of the issues noted above exist
in the original implementation from 2014.
--
--
Ticket URL: <https://trac.xapian.org/ticket/810#comment:1>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list