[Xapian-devel] Handling Negative value due to logarithm of probabilities.
Gaurav Arora
gauravarora.daiict at gmail.com
Fri Apr 27 05:58:30 BST 2012
Hi,
In continuation of the discussion of melange comments,about negative value
returned in matcher due to logarithm of probabilities.
*I**f we make K suitably large, we could clamp each log(K.Pi) to be >= 0,
and this change will only affect really low probability terms (those with
Pi < 1/K, so you can adjust K to suit):*
*W' = sum(i=1,...,n, max(log(K.Pi), 0))*
Did you mean for low probability the the value returned by log(K.Pi) would
be negative. So replace lower probability, which still gives negative value
by 0?
Assigning 0 will be equivalent to rejecting term from the query
completely,which hurts the retrieval performance in Language Model as the
term missing from the document are smoothened with collection frequency.
I think we must try the smoothing from collection statistics if the
document term probability doesn't work(is generating negative value).
*sum(*
*i=1,...,n, if( max(log(K.Pi), 0) == 0)*
*max(max(log(K.Pcollec.i),0)*
*else*
***log(K.Pi)*
*)*
In case both doesnt work return 0 would be only option .
Moreover selecting a large enough K would be a tricky task as as no K would
be large enough since log(x) -> -inf as x -> 0
Should we approach selecting value of K by statistically, i will mean to
run the unigram Weighting scheme on large collection and observing lowest
probability which could be found and hence approximating the value of K or
any other method.
I asked same Question on Stack overflow about this.
http://goo.gl/ykwN4
They suggested:
*"Could you simple take the negative of the logarithm? Since you are
dealing with probabilities (i.e. values <= 1), the logarithm is always
negative,
so negating it will always make it positive."*
But this approach wont be a good idea as large values will indicates low
probabilites,small values will indicate high probabilities.Hence matcher
will tend to skip some good documents from ranked list due to lower weight.
Thanks,
*-- *
with regards
Gaurav A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120427/63df3bb6/attachment-0001.htm>
More information about the Xapian-devel
mailing list