[Xapian-devel] Dealing with negative weights
Olly Betts
olly at survex.com
Fri Jun 21 12:11:54 BST 2013
On Thu, Jun 20, 2013 at 05:10:30PM +0530, Aarsh Shah wrote:
> Hello guys. I am currently working on the DLH weighting scheme .The formula
> for DLH is very complex and it ends up giving negative weights to some
> documents because of the formula.Due to this,inspite of having
> occurence/occurences of the keyword, the documents with negative weights
> don't show up in the results at all. Please can I get some help on how to
> deal with this ? Or should I just leave it as it is and let the poor
> documents suffer by virtue of them having statistics not suitable for DLH ?
Xapian assumes each component of the weight sum is positive - if you
return a negative component, the matcher optimisations will go wrong.
If there's a lower bound, you might be able to address this by
subtracting that bound and adjusting the term-independent component to
compensate for the terms which don't match each document.
E.g. if the weight contributed by query term t in doc d is W(t,d) and
Wi(d) is the term independent component, then the weight for document d
is:
W_sum(d) = Sum{t in d}(W(t,d)) + Wi(d)
If we have a lower bound for W(t,d):
(a) W_low(t) <= W(t,d) for all d
And it's negative (or if the weight for a given term is always >= 0,
just make this lower bound zero):
(b) W_low(t) <= 0
And similarly for Wi(d):
(c) Wi_low <= Wi(d) for all d
And let's only adjust Wi if we have to:
(d) Wi_low <= 0
Then you can transform your weighting scheme to this one:
W'(t,d) = W(t,d) - W_low(t)
then: W'(t,d) >= 0 (from (a))
Wi'(d) = Wi(d) - Wi_low - Sum{t not in d}(W_low(t))
so: Wi'(d) >= Wi(d) - Wi_low (from (b))
so: Wi'(d) >= 0 (from (c))
And the total weight for document d is:
W_sum'(d) = Sum{t in d}(W'(t,d)) + Wi'(d)
= Sum{t in d}{W(t,d) - W_low(t)) + Wi(d) - Wi_low - Sum{t not in d}(W_low(t))
= Sum{t in d}(W(t,d)) + Wi(d) - Wi_low - Sum{t}(W_low(t))
= W_sum(d) - Wi_low - Sum{t}(W_low(t))
So that's simply added something to every weight which is constant for
a given query on a given database - the relative ordering of the weights
is preserved.
Cheers,
Olly
More information about the Xapian-devel
mailing list