<div dir="ltr">I was adding the calculations for a lower bound to get_sumpart() (DLH has no term independent component) when I realized that the same lower bound will be calculated for each term-docment pair that get_sumpart is called pair which basically reduces efficiency. How do I calculate the lower bound for a term only once and then use it ?<br>
<br>-Regards<br>-Aarsh</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Jun 21, 2013 at 4:41 PM, Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Thu, Jun 20, 2013 at 05:10:30PM +0530, Aarsh Shah wrote:<br>
> Hello guys. I am currently working on the DLH weighting scheme .The formula<br>
> for DLH is very complex and it ends up giving negative weights to some<br>
> documents because of the formula.Due to this,inspite of having<br>
> occurence/occurences of the keyword, the documents with negative weights<br>
> don't show up in the results at all. Please can I get some help on how to<br>
> deal with this ? Or should I just leave it as it is and let the poor<br>
> documents suffer by virtue of them having statistics not suitable for DLH ?<br>
<br>
</div></div>Xapian assumes each component of the weight sum is positive - if you<br>
return a negative component, the matcher optimisations will go wrong.<br>
<br>
If there's a lower bound, you might be able to address this by<br>
subtracting that bound and adjusting the term-independent component to<br>
compensate for the terms which don't match each document.<br>
<br>
E.g. if the weight contributed by query term t in doc d is W(t,d) and<br>
Wi(d) is the term independent component, then the weight for document d<br>
is:<br>
<br>
W_sum(d) = Sum{t in d}(W(t,d)) + Wi(d)<br>
<br>
If we have a lower bound for W(t,d):<br>
<br>
(a) W_low(t) <= W(t,d) for all d<br>
<br>
And it's negative (or if the weight for a given term is always >= 0,<br>
just make this lower bound zero):<br>
<br>
(b) W_low(t) <= 0<br>
<br>
And similarly for Wi(d):<br>
<br>
(c) Wi_low <= Wi(d) for all d<br>
<br>
And let's only adjust Wi if we have to:<br>
<br>
(d) Wi_low <= 0<br>
<br>
Then you can transform your weighting scheme to this one:<br>
<br>
W'(t,d) = W(t,d) - W_low(t)<br>
then: W'(t,d) >= 0 (from (a))<br>
<br>
Wi'(d) = Wi(d) - Wi_low - Sum{t not in d}(W_low(t))<br>
so: Wi'(d) >= Wi(d) - Wi_low (from (b))<br>
so: Wi'(d) >= 0 (from (c))<br>
<br>
And the total weight for document d is:<br>
<br>
W_sum'(d) = Sum{t in d}(W'(t,d)) + Wi'(d)<br>
= Sum{t in d}{W(t,d) - W_low(t)) + Wi(d) - Wi_low - Sum{t not in d}(W_low(t))<br>
= Sum{t in d}(W(t,d)) + Wi(d) - Wi_low - Sum{t}(W_low(t))<br>
= W_sum(d) - Wi_low - Sum{t}(W_low(t))<br>
<br>
So that's simply added something to every weight which is constant for<br>
a given query on a given database - the relative ordering of the weights<br>
is preserved.<br>
<br>
Cheers,<br>
Olly<br>
</blockquote></div><br></div>