<div dir="ltr">I was adding the calculations for a lower bound to get_sumpart() (DLH has no term independent component) when I realized that the same lower bound will be calculated for each term-docment pair that get_sumpart is called pair which basically reduces efficiency. How do I calculate the lower bound for a term only once and then use it ?<br>

<br>-Regards<br>-Aarsh</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Jun 21, 2013 at 4:41 PM, Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Thu, Jun 20, 2013 at 05:10:30PM +0530, Aarsh Shah wrote:<br>

> Hello guys. I am currently working on the DLH weighting scheme .The formula<br>

> for DLH is very complex and it ends up giving negative weights to some<br>

> documents because of the formula.Due to this,inspite of having<br>

> occurence/occurences of the keyword, the documents with negative weights<br>

> don't show up in the results at all. Please can I get some help on how to<br>

> deal with this ? Or should I just leave it as it is and let the poor<br>

> documents suffer by virtue of them having statistics not suitable for DLH ?<br>

<br>

</div></div>Xapian assumes each component of the weight sum is positive - if you<br>

return a negative component, the matcher optimisations will go wrong.<br>

<br>

If there's a lower bound, you might be able to address this by<br>

subtracting that bound and adjusting the term-independent component to<br>

compensate for the terms which don't match each document.<br>

<br>

E.g. if the weight contributed by query term t in doc d is W(t,d) and<br>

Wi(d) is the term independent component, then the weight for document d<br>

is:<br>

<br>

  W_sum(d) = Sum{t in d}(W(t,d)) + Wi(d)<br>

<br>

If we have a lower bound for W(t,d):<br>

<br>

  (a) W_low(t) <= W(t,d) for all d<br>

<br>

And it's negative (or if the weight for a given term is always >= 0,<br>

just make this lower bound zero):<br>

<br>

  (b) W_low(t) <= 0<br>

<br>

And similarly for Wi(d):<br>

<br>

  (c) Wi_low <= Wi(d) for all d<br>

<br>

And let's only adjust Wi if we have to:<br>

<br>

  (d) Wi_low <= 0<br>

<br>

Then you can transform your weighting scheme to this one:<br>

<br>

  W'(t,d) = W(t,d) - W_low(t)<br>

  then:  W'(t,d) >= 0  (from (a))<br>

<br>

  Wi'(d) = Wi(d) - Wi_low - Sum{t not in d}(W_low(t))<br>

  so:  Wi'(d) >= Wi(d) - Wi_low  (from (b))<br>

  so:  Wi'(d) >= 0  (from (c))<br>

<br>

And the total weight for document d is:<br>

<br>

  W_sum'(d) = Sum{t in d}(W'(t,d)) + Wi'(d)<br>

    = Sum{t in d}{W(t,d) - W_low(t)) + Wi(d) - Wi_low - Sum{t not in d}(W_low(t))<br>

    = Sum{t in d}(W(t,d)) + Wi(d) - Wi_low - Sum{t}(W_low(t))<br>

    = W_sum(d) - Wi_low - Sum{t}(W_low(t))<br>

<br>

So that's simply added something to every weight which is constant for<br>

a given query on a given database - the relative ordering of the weights<br>

is preserved.<br>

<br>

Cheers,<br>

    Olly<br>

</blockquote></div><br></div>