[Xapian-devel] Dealing with negative weights
Aarsh Shah
aarshkshah1992 at gmail.com
Sat Jun 22 08:11:52 BST 2013
I was adding the calculations for a lower bound to get_sumpart() (DLH has
no term independent component) when I realized that the same lower bound
will be calculated for each term-docment pair that get_sumpart is called
pair which basically reduces efficiency. How do I calculate the lower bound
for a term only once and then use it ?
-Regards
-Aarsh
On Fri, Jun 21, 2013 at 4:41 PM, Olly Betts <olly at survex.com> wrote:
> On Thu, Jun 20, 2013 at 05:10:30PM +0530, Aarsh Shah wrote:
> > Hello guys. I am currently working on the DLH weighting scheme .The
> formula
> > for DLH is very complex and it ends up giving negative weights to some
> > documents because of the formula.Due to this,inspite of having
> > occurence/occurences of the keyword, the documents with negative weights
> > don't show up in the results at all. Please can I get some help on how to
> > deal with this ? Or should I just leave it as it is and let the poor
> > documents suffer by virtue of them having statistics not suitable for
> DLH ?
>
> Xapian assumes each component of the weight sum is positive - if you
> return a negative component, the matcher optimisations will go wrong.
>
> If there's a lower bound, you might be able to address this by
> subtracting that bound and adjusting the term-independent component to
> compensate for the terms which don't match each document.
>
> E.g. if the weight contributed by query term t in doc d is W(t,d) and
> Wi(d) is the term independent component, then the weight for document d
> is:
>
> W_sum(d) = Sum{t in d}(W(t,d)) + Wi(d)
>
> If we have a lower bound for W(t,d):
>
> (a) W_low(t) <= W(t,d) for all d
>
> And it's negative (or if the weight for a given term is always >= 0,
> just make this lower bound zero):
>
> (b) W_low(t) <= 0
>
> And similarly for Wi(d):
>
> (c) Wi_low <= Wi(d) for all d
>
> And let's only adjust Wi if we have to:
>
> (d) Wi_low <= 0
>
> Then you can transform your weighting scheme to this one:
>
> W'(t,d) = W(t,d) - W_low(t)
> then: W'(t,d) >= 0 (from (a))
>
> Wi'(d) = Wi(d) - Wi_low - Sum{t not in d}(W_low(t))
> so: Wi'(d) >= Wi(d) - Wi_low (from (b))
> so: Wi'(d) >= 0 (from (c))
>
> And the total weight for document d is:
>
> W_sum'(d) = Sum{t in d}(W'(t,d)) + Wi'(d)
> = Sum{t in d}{W(t,d) - W_low(t)) + Wi(d) - Wi_low - Sum{t not in
> d}(W_low(t))
> = Sum{t in d}(W(t,d)) + Wi(d) - Wi_low - Sum{t}(W_low(t))
> = W_sum(d) - Wi_low - Sum{t}(W_low(t))
>
> So that's simply added something to every weight which is constant for
> a given query on a given database - the relative ordering of the weights
> is preserved.
>
> Cheers,
> Olly
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130622/88a2724c/attachment.html>
More information about the Xapian-devel
mailing list