[Xapian-devel] Dealing with negative weights

Olly Betts olly at survex.com
Fri Jun 21 12:11:54 BST 2013


On Thu, Jun 20, 2013 at 05:10:30PM +0530, Aarsh Shah wrote:
> Hello guys. I am currently working on the DLH weighting scheme .The formula
> for DLH is very complex and it ends up giving negative weights to some
> documents because of the formula.Due to this,inspite of having
> occurence/occurences of the keyword, the documents with negative weights
> don't show up in the results at all. Please can I get some help on how to
> deal with this ? Or should I just leave it as it is and let the poor
> documents suffer by virtue of them having statistics not suitable for DLH ?

Xapian assumes each component of the weight sum is positive - if you
return a negative component, the matcher optimisations will go wrong.

If there's a lower bound, you might be able to address this by
subtracting that bound and adjusting the term-independent component to
compensate for the terms which don't match each document.

E.g. if the weight contributed by query term t in doc d is W(t,d) and
Wi(d) is the term independent component, then the weight for document d
is:

  W_sum(d) = Sum{t in d}(W(t,d)) + Wi(d)

If we have a lower bound for W(t,d):

  (a) W_low(t) <= W(t,d) for all d

And it's negative (or if the weight for a given term is always >= 0,
just make this lower bound zero):

  (b) W_low(t) <= 0

And similarly for Wi(d):

  (c) Wi_low <= Wi(d) for all d

And let's only adjust Wi if we have to:

  (d) Wi_low <= 0

Then you can transform your weighting scheme to this one:

  W'(t,d) = W(t,d) - W_low(t)
  then:  W'(t,d) >= 0  (from (a))

  Wi'(d) = Wi(d) - Wi_low - Sum{t not in d}(W_low(t))
  so:  Wi'(d) >= Wi(d) - Wi_low  (from (b))
  so:  Wi'(d) >= 0  (from (c))

And the total weight for document d is:

  W_sum'(d) = Sum{t in d}(W'(t,d)) + Wi'(d)
    = Sum{t in d}{W(t,d) - W_low(t)) + Wi(d) - Wi_low - Sum{t not in d}(W_low(t))
    = Sum{t in d}(W(t,d)) + Wi(d) - Wi_low - Sum{t}(W_low(t))
    = W_sum(d) - Wi_low - Sum{t}(W_low(t))

So that's simply added something to every weight which is constant for
a given query on a given database - the relative ordering of the weights
is preserved.

Cheers,
    Olly



More information about the Xapian-devel mailing list